Personalized human health care products and services that enable individuals to more actively manage their health based on their genetic profiles have been increasingly heralded following the publication of a draft human genome sequence in June 2000 (Venter, J C, Funct Integr Genomics. 2000 November; 1(3):154-5) and a substantially complete sequence of the human genome in February 2001. (Venter, J C et al., Science 291(5507):1304-51 [2001]; Lander E S et al., Nature 409(6822):860-921 [2001]). To date, however, the commercial availability of personalized genetic profile products and services has been extremely limited and costly.
The “genome” of an individual member of a species comprises that individual's complete set of genes. Particular locations within the genome of a species are referred to as “loci” or “sites”. “Alleles” are varying forms of the genomic DNA located at a given site. In the case of a site where there are two distinct alleles in a species, referred to as “A” and “B”, each individual member of the species can have one of four possible combinations: AA; AB; BA; and BB. The first allele of each pair is inherited from one parent, and the second from the other.
The “genotype” of an individual at a specific site in the individual's genome refers to the specific combination of alleles that the individual has inherited. A “genetic profile” for an individual includes information about the individual's genotype at a collection of sites in the individual's genome. As such, a genetic profile is comprised of a set of data points, where each data point is the genotype of the individual at a particular site.
Genotype combinations with identical alleles (e.g., AA and BB) at a given site are referred to as “homozygous”; genotype combinations with different alleles (e.g., AB and BA) at that site are referred to as “heterozygous.” It has to be noted that in determining the allele in a genome using standard techniques AB and BA cannot be differentiated, meaning it is impossible to determine from which parent a certain allele was inherited, given solely the genomic information of the individual tested. Moreover, variant AB parents can pass either variant A or variant B to their children. While such parents may not have a predisposition to develop a disease, their children may. For example, two variant AB parents can have children who are variant AA, variant AB, or variant BB. For example, one of the two homozygotic combinations in this set of three variant combinations may be associated with a disease. Having advance knowledge of this possibility allows potential parents to make the best possible decisions about their children's health.
Diseases are often associated with the collection of atoms, molecules, macromolecules, cells, tissues, organs, structures, fluids, metabolic, respiratory, pulmonary, neurological, reproductive or other physiological function, reflexes, behaviors and other physical characteristics observable in the individual through various means. The “phenotype” of an individual refers to one or more of these observable physical characteristics. An individual's phenotype is driven in large part by constituent proteins in the individual's proteome, the collection of all proteins produced by the cells comprising the individual and coded for in the individual's genome.
In many cases, a given phenotype can be associated with a specific genotype. For example, an individual with a certain pair of alleles for the gene that encodes for a particular lipoprotein associated with lipid transport may exhibit a phenotype characterized by a susceptibility to a hyperlipidemous disorder that leads to heart disease.
While efforts have been undertaken to create new personalized active health management products and services based on genetic profiles, several shortcomings characterize the existing art. Among these shortcomings are the following:    First, the mix of existing products and services are in the aggregate narrowly focused on a small set of disease phenotypes, making them inefficient in enabling health management practices that encompass a broad set of phenotypes;    Second, existing genetic testing products and services are each focused on a genetic indication for one or a small set of diseases;    Third, until the high cost of sequencing the genome of an individual human declines by several orders of magnitude, an alternative to genome sequencing technology must be used as the basis for genetic profile products and services, and currently available alternatives require substantial modification in order to be integrated into the array of technologies and logistics necessary to provide genetic profile products and services encompassing a comprehensive set of diseases;    Fourth, existing informatics and database management tools do not scale efficiently or effectively to the dynamic and exponential growth of reported scientific research and clinical findings underlying genetic profile products and services, resulting in a high degree of information obsolescence;    Fifth, existing genetic profile products and services are designed to be used at key life events, such as disease onset, family disease onset, preconception and prenatal events, and typically by one or more members of a family with an already-known history a particular disease among its generations, rather than as part of a comprehensive personalized health management program; and    Sixth, genetic counseling practices, focused on point tests assessed at key life events must be significantly altered to support the increase in information volume and complexity arising from broad-based genetic profiling.
The objective of personalized genetic profile health management products and services is to provide individuals with information about their predisposition to diseases. Armed with this information, individuals can, in many instances, make decisions about their dietary practices, pharmaceutical use, exercise, and other lifestyle habits that are designed to better manage their predisposition to diseases.
From individual to individual within any species, genes are characterized by a very high degree of conservation in the sequence of nucleotide base pairs comprising them. At certain locations in many sites, however, the specific nucleotides that comprise a gene can undergo alteration, or mutation. Mutations can be inherited from a parent or acquired during a person's life. A hereditary mutation will be present in all of a person's cells and will be passed on to future generations, because the person's reproductive cells (sperm or egg) will contain the mutation. An acquired mutation can arise in the DNA of individual cells as a result of many possible factors. For example, mutations in the DNA of skin cells can be caused by exposure to the sun's UV radiation Genetic mutations in other cells can arise from errors that occur just prior to cell division, during which a cell makes a copy of its DNA before dividing into two. Genetic profile products and services tend to focus on hereditary mutations.
The situation in which two or more sequence variants of an allele exist at a site across different members of a population is called a “polymorphism,” typically defined as having an occurrence frequency of larger than 1% within that population. Several different types of polymorphisms are known in the art. By far the most common form of polymorphisms are those involving single nucleotide variations between individuals of the same species; such polymorphisms are called “single nucleotide polymorphisms”, or “SNPs”. To date, at least 1.42 million SNPs have been identified in the human genome. (Sachidanandam R et al., Nature 409(6822):928-33 [2001]). While it is believed that the great preponderance of these SNPs are harmless, there is a substantial number that have been associated with various diseases.
SNPs that occur in the protein coding regions of genes that give rise to the expression of variant or defective proteins are potentially the cause of a genetic-based disease. Even SNPs that occur in non-coding regions can result in altered mRNA and/or protein expression. Examples are SNPs that defective splicing at exon/intron junctions. Exons are the regions in genes that contain three-nucleotide codons that are ultimately translated into the amino acids that form proteins. Introns are regions in genes that can be transcribed into pre-messenger RNA but do not code for amino acids. In the process by which genomic DNA is transcribed into messenger RNA, introns are often spliced out of pre-messenger RNA transcripts to yield messenger RNA.
For example, in the “healthy” form of the protein hemoglobin, the amino acid at the sixth position in the protein's beta chain is glutamic acid. This amino acid is encoded in the hemoglobin gene by the DNA codon guanine-adenine-guanine (GAG). In some individuals, however, the adenine nucleotide in this codon is replaced with the thymine nucleotide, resulting in a GTG codon which codes for the amino acid valine. This substitution of valine for glutamic acid alters the normal shape of the hemoglobin protein. Red blood cells that contain these abnormally shaped hemoglobin proteins exhibit a sickle shape and are unable to perform the oxygen-transport function normally associated with red blood cells. Individuals who are GTG homozygous (i.e., have inherited a GTG variant from each parent) suffer from sickle cell anemia.
In addition to sickle cell anemia, SNPs have been associated with diseases such as cystic fibrosis, Huntington's chorea, beta-thalassemia, muscular dystrophy, fibro muscular displasia, pheny ketonuria, Type II diabetes, a hyperlipidemous disorder associated with Apolipoprotein E2, at least one form of hypertension, and some forms of migraine headaches. These disease-associated SNPs are inherited through classic Mendelian mechanisms. This type of SNP, however, is not believed to be the predominant form of SNPs associated with the most common diseases. This view is supported by the theory that common mutations in the protein coding regions would entirely dysfunction protein structures and therefore completely shutdown a specific pathway or parts of such pathways, a result which is not supported by observation. Nevertheless, it is believed that functional variants associated with phenotypes further associated with diseases should be clustered around non-coding sites that play an important role in the functioning of the genome.
An example of such functional, non-coding sites are the “splice sites” at which pre-messenger RNA transcripts are spliced into messenger RNA (mRNA). The need for splicing arises from the fact that within the pre-messenger RNA transcripts are RNA base pairs that correspond to introns in the genomic DNA from which the pre-messenger RNA transcript derives. The complex of proteins and RNA at which splicing occurs is called the “spliceosome”. (See, e.g., Fairbrother et al. 2002).
A few different methods are commonly used to analyze DNA for polymorphisms and genotype. The most definitive method is to sequence the DNA to determine the actual base sequence (see, A. M. Maxam and W. Gilbert, Proc. Natl. Acad. Sci. USA 74:560 (1977); Sanger et al., Proc. Natl. Acad. Sci. USA 74:5463 (1977)). Patent application 20020082869, “Method and system for providing and updating customized health care information based on an individual's genome”, Anderson, Glen J., describes a system for delivering personalized genetic profiling information based on sequencing. Although such a method is the most definitive it is also the most expensive and time-consuming method. Accordingly, the sequencing of the human genome has only been performed for research purposes such as the Human Genome Project on samples from a very small number of individual humans, and at a cost of millions of dollars per individual. While the cost of sequencing the genome of an individual human has been following a steeply declining price/performance curve, where performance is measured in terms of accuracy and time, the substantial cost that still stands today prohibits its use on a broad commercial scale. Until the cost of sequencing technologies declines substantially further, the delivery of genetic profiles to a significantly large number of individuals cannot be cost effectively based on genome sequencing. Moreover, as described below, simply being able to sequence an individual's genome is not sufficient to generate and provide a comprehensive genetic profile product or service to the individual.
Another method of analyzing DNA for polymorphisms and genotype is restriction mapping analysis. With this method genomic DNA is digested with a restriction enzyme and the resulting fragments are analyzed on an electrophoresis gel or with a Southern blot to determine the presence or absence of a polymorphism that changes the recognition site for the restriction enzyme. This method can also be used to determine the presence or absence of gross insertions or deletions in genomic DNA by observing the lengths of the resulting DNA fragments. In this respect, restriction mapping analysis has limited use in the type of genome-wide search for polymorphisms and genotyping analysis required for providing genetic profile products and services of the type contemplated by the present invention.
Another method of determining the genotype of an individual at a given site is to detect the presence of one or more nucleotide sequences at that site known to be associated with a predisposition, disease or other phenotypic abnormality. These sites, also called “genetic markers,” can be detected using various tagged oligonucleotide hybridization technologies that are significantly less costly than genomic sequencing and allele-specific hybridization. Means now exist for constructing and performing large-scale, multiplexed genetic marker hybridization tests on biological samples from individuals, such as samples of blood, saliva and urine. These means, such as very dense chip and bead arrays, can enable a sample from an individual to be tested simultaneously for the presence of thousands of genetic markers. (Oliphant A et al., Biotechniques Suppl:56-8, 60-1 [2002]; and Fodor S P, Science 251(4995):767-73 [1991]).
Splice junctions in pre-messenger RNA, 5-prime (exon to intron transition) and 3-prime (intron to exon transitions), are the sequence regions that are used as recognition sites for the spliceosome and contain certain sequence motifs which typically are conserved between related species. Nucleotide changes in these binding sites can have a substantial effect on the spliced mRNA product, depending on the effect of the mutation on the overall binding affinity of the spliceosome components with the mRNA sequence. Known mis-splicing behavior arises from exon skipping, alternative splicing, protein coding truncation through the introduction of a frame shift, and the disruption of the entire mRNA production process. These changes have significant effects in the mRNA and protein processing step and can totally change their production. In addition, smaller changes can partially regulate and influence quantitatively the splicing behavior of certain genes. Additional sites known to be involved and sometimes even known to regulate splicing, are the branch-point, enhancer and silencer sequences (Fairbrother et al. 2002). Splice sites constitute locations in the genome for evolutionary pressure to function through nucleotide mutations.
Similarly, promoter regions in genes constitute locations in the genome where the presence of a SNP can be used for determining an individual's genotype. As gene-expression regulatory mechanisms, promoter regions include the transcription start site and various transcription factor-binding sites, including all the regions that are involved in gene regulation.
The determination of the presence of polymorphisms or, less frequently, mutations, in DNA has become a very important tool for a variety of purposes. Detecting mutations that are known to cause or to predispose persons to disease is one of the more important uses of determining the possible presence of a mutation. One example is the analysis of the gene named BRCA1 that may result in breast cancer if it is mutated (see, Mild et al., Science, 266:66-71, 1994). Several known mutations in the BRCA1 gene have been causally linked with breast cancer. It is now possible to screen women for these known mutations to determine whether they are predisposed to develop breast cancer. Some other uses for determining polymorphisms or mutations are for genotyping and for mutational analysis for positional cloning experiments.
In some cases, as illustrated in the case of the hemoglobin SNP and sickle cell anemia, the association of a SNP with a disease is direct and well-established and can be simply diagnosed. In many other cases, however, the association of a SNP with a phenotype that gives rise to disease or other adverse medical condition is not well-established and different diseases, disorders have different associations with different genotypes and SNPs. In these cases, the association between genotype and phenotype can vary from individual to individual in a complex manner that depends on the individual's genome, age, family history, life style habits, and other personal health and demographic factors. Consequently, direct testing of the individual's DNA is not accurate 100% of the time in predicting the onset of a genetically-based disease or other adverse medical condition. In these more complex cases, there is a probabilistic relationship between genotype (as characterized by different variants and SNPs) and phenotype (as characterized by the association of phenotypes with different diseases). In these cases, the presence of a SNP at a given genetic site is not sufficient by itself for the development of a pathological condition In addition, not all persons possessing a given SNP in a given variant will develop a disease associated with that SNP. The onset of a genetically-based disease may also depend on exposure to certain conditions in a person's environment. Moreover, the same disease, disorder, or other adverse medical condition associated with a given SNP in a given variant may result from a different SNP at another site. Consequently, comprehensive analysis of the relationship between an individual's genotype and phenotype requires a scoring matrix of variables along various dimensions and a method of using this matrix to determine the probability that a given genotype in a given individual win result in a given phenotype.
To further illustrate the complexity of associating genotypes with phenotypes, it is currently believed that the human genome comprises approximately 30,000 genes while the human proteome comprises potentially millions of proteins. The process by which the information contained in the DNA comprising 30,000 distinct genes is transcribed into messenger RNA, which is in turn translated into the sequence of amino acids comprising potentially millions of distinct proteins, therefore adds significant complexity to associations of genotypes with phenotypes. In addition, in the search for unknown disease-causing variants, whole-genome association scans using hundreds of thousands of genetic markers simultaneously are likely to face serious theoretical-statistical challenges, as well as practical difficulties associated with the management of data sets of enormous size and complexity. One obvious problem is the fact that, the more genetic markers are used, the higher the expected number of apparent, spurious associations that are the result of statistical chance as opposed to true association stemming from shared genealogy between genetic marker and causative allele.
Beyond this complexity of associating genotypes with phenotypes, there has been rapid growth of data on the existence of SNPs, their locations in the human genome, and associations of SNPs with phenotypes that are further associated with various diseases. This data arises from research on genomics, proteomics, preclinical and clinical studies of pharmaceuticals and related research gathered from laboratories, hospitals and medical clinics around the world.
Genomics has the potential to change the way medicine is practiced and impact the health of individuals. (E.g., Guttmacher A E, Collins F S, Genomic medicine—a primer, N Engl J Med 347: 1512-20. and Collins [2002]; Varmus, Getting ready for gene-based medicine, N Engl J Med 347: 1526-7 [2002]). Through sequencing and genotyping, extensive personal genetic information is expected to continue to be generated in large quantities in coming years. (Trager R S, DNA sequencing. Venter's next goal: 1000 human genomes, Science 298: 947 [2002]). The rapid growth in volume of available genomic and proteomic data has been characterized by substantial disorder and information obsolescence. For example, currently, the Gene Ontology (GO) consortium (Ashburner, M et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-9 [2000]) and the National Library of Medicine's MESH (Schulman J-L, What's New for 2001 MeSH, NLM Tech Bull. 317 [2001]) are two of the best-known ontologies in the bioinformatics domain Neither of these ontologies, however, currently contains the necessary information to support research about the relationships between genes and disease in the context of the human genome. While GO is well suited to classify a gene product in terms of its function, process and location, it has no terms to describe human diseases and the relations between them, whereas MESH, while rich in descriptions and classifications of human disease, contains no information about sequences, little information about genes, and no information about disease causing mutations and SNPs. This is an unfortunate situation, especially in light of the recent completion of the human genome sequence, and its annotation.
A key aspect of research in genetics is the association of sequence variation with disease genes and phenotypes. Sequence variation data are currently available, for example, from OMIM, HGMD (Hamosh A, et al., Online Mendelian Inheritance in Man (OMIM), a Knowledge base of human genes and genetic disorders, Nucleic Acids Res 30: 52-5 [2002]; Krawczak, M et al., Human gene mutation database—a biomedical information and research resource, Hum Mutat 15: 45-51 [2000]; McKusick V A, Online Mendelian Inheritance in Man, OMIM (TM), McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins [2000]) and others, both of which provide phenotypic information and describe amino acid variation. Unfortunately, in most cases these variation references do not provide sufficient information to support their direct mapping onto current genomic sequences and the associated annotated genes. Single nucleotide polymorphism (SNP) data are held in dbSNP and other publicly accessible databases. (E.g., Sachidanandam, R et al., A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms, Nature 409: 928-33 [2001]; Sherry, S T et al., dbSNP: the NCBI database of genetic variation, Nucleic Acids Res 29: 308-11.2001). While these databases contain millions of entries each including the position of the SNIP on the genome, they do not provide significant phenotypic information about the SNPs at the levels which need to be reached, namely from the genome to the phenotype and the clinic.
Moreover, as the volume of genomic and proteomic data grows there are requirements to synthesize vast amounts of information to enable clearer understanding of an individual's genetic profile. Patent application 20020052761 (“Method and system for genetic screening data collection, analysis, report generation and access”, Fey, Christopher T.; et al.) describes a system for generating highly complex personal health reports to individuals concerning their genetic test results, based on an aggregate set of genetic markers and phenotypes.
Over the years most genomic and clinical advances have been published in scientific journals. Molecular biology advances and finding have been published predominantly in molecular biology journals (e.g., Cell, Nat Genetics, Am J. Hum. Genetics, etc.), and clinical phenotype related findings have been published predominantly in medical journals (e.g., N Eng J Medicine, Lancet, etc.). Because of these different journals are directed to different communities, large communication gaps have been created. Thus, there now exists in the public domain two distinct information resources, and neither is as valuable as it potentially could be because current research efforts require their integration. One part includes all large public genomic databases and the other is the vast amount of clinical research data, mostly held in publication, but increasingly accessible electronically. There is a clear tendency in the community for a computer-based classification of disease through ontologies and relating medical diagnostic classification schemes such as the ICD-9 with gene diseases (e.g., the NuGene project; Chisholm, R et al., The Nugene Project [2003]).
There are currently various standardization efforts occurring within molecular biology, most notably the gene ontology (GO) consortium efforts. (Ashburner, M et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-9 [2000]). Additional ontologies such as the sequence ontology, the mutation ontology and others are in work in process (for an overview of ontology development please see the “global open biological ontologies” (GOBO) web site (www.geneontology.org/doc/gobo.html).
Databases containing information on polymorphisms are also expected to have an important impact in the field of pharmacogenomics. Pharmacogenomics is an area of research focused on how variations in a patient's DNA can cause pharmaceuticals to respond differently. The importance of understanding these variations is underscored by the number of hospitalizations and deaths that occur each year that are caused by adverse drug reactions. One method of characterizing the genetic basis of drug response is by cataloging variations in drug response as a function of SNPs. The more SNPs cataloged, the more robust and effective the database. However, collecting and sorting the SNPs becomes a huge undertaking. In U.S. Patent application 20020049772, Reinhoff, et al, provides a broad overview of polymorphisms, pharmacogenomics, and classifying populations based upon sets of polymorphisms.
In addition to the scientific, technological and medical complexities that characterize the development and commercialization of genetic profile products and services, there are growing legal and regulatory complexities. For example, patient privacy has been a growing concern in multiple jurisdictions. In Europe, the European Union Directive 95/46/EC is designed to protect individuals with regard to the processing and movement of their personal data. In the United States, under the Health Insurance Portability and Accountability Act of 1996, commonly referred to as “HIPAA”, regulations have been adopted that set forth “Standards for Privacy of Individually Identifiable Health Information”. The purpose of these regulations is to help guarantee privacy and confidentiality of patient medical records. These Standard are quite extensive and apply to health care providers, insurers, payors and employers.
The confluence of all of the factors discussed above leads to the conclusion that what has been lacking from the art, but necessary for viable broad-based commercial provision of personalized health management products and services based on genetic profiling, is a method that satisfies the following requisites:    (1) the genotype of the individual to whom such products and services are being provided must be accurately and economically determinable at a large number of sites in that individual's genome relevant to a broad selection of different diseases;    (2) a large, dynamic, well-curated, database containing the associations between diverse genotypes and phenotypes must be maintained, easily accessed, and updated at very frequent intervals;    (3) for each individual to whom such products and services are being provided, the individual's genotype at each such site must be easily analyzed and filtered through such database to determine the individual's phenotype and construct the individual's genetic profile;    (4) the genetic profile so constructed for each individual and its implications must be easily communicated to the individual and the individual's physicians and medical/health care counselors in an effective manner that complies with health care, privacy and other laws and regulations.
Various means exist for practicing each of these separately. Each such means, however, suffers from various deficiencies, and a method of collectively optimizing their combined practice is required in order to provide health care management products and services on a broad commercial scale at prices that are economically attractive to both provider and customer. The present invention provides these and other benefits.