Monogenic diseases are frequent causes of neonatal morbidity and mortality, and disease presentations are often undifferentiated at birth. More than 3,500 monogenic diseases have been characterized, but clinical testing is available for only some of them and many feature clinical and genetic heterogeneity. As such, an immense unmet need exists for improved molecular diagnosis in infants. Because disease progression is extremely rapid, albeit heterogeneous, in newborns, molecular diagnoses must occur quickly to be relevant for clinical decision-making.
Neonatal intensive care units (NICUs) are especially suitable for early adoption of diagnostic whole genome shotgun analysis (WGS), because many of the 3,528 monogenic diseases of known cause present during the first 28 days of life. In the United States, more than 20% of infant deaths are caused by congenital malformations, deformations, and chromosomal abnormalities that cause genetic diseases. Although this proportion has remained unchanged for the past 20 years, the prevalence of monogenic diseases in NICUs is not known because ascertainment rates are low. Serial gene sequencing is too slow to be clinically useful for NICU diagnosis. Newborn screens, while rapid, identify only a few genetic disorders for which inexpensive tests and cost-effective treatments exist. Further complicating diagnosis is the fact that the full clinical phenotype can not be manifest in newborn infants (neonates), and genetic heterogeneity can be immense. Thus, acutely ill neonates with genetic diseases are often discharged or deceased before a diagnosis is made. As a result, NICU treatment of genetic diseases is usually empirical, can lack efficacy, be inappropriate, or cause adverse effects.
NICUs are also suitable for early adoption of genomic medicine because extraordinary interventional efforts are customary and innovation is encouraged. Indeed, NICU treatment is among the most cost-effective of high-cost health care, and the long-term outcomes of most NICU subpopulations are excellent. In genetic diseases for which treatments exist, rapid diagnosis is critical for timely delivery of neonatal interventions that lessen morbidity and mortality. For neonatal genetic diseases without effective therapeutic interventions, of which there are many, timely diagnosis avoids futile intensive care and is critical for research to develop management guidelines that optimize outcomes. In addition to influencing treatment, neonatal diagnosis of genetic disorders and genetic counseling can spare parents diagnostic odysseys that instill inappropriate hope or perpetuate needless guilt.
Two recent reports exemplify the diagnostic and therapeutic uses of next generation sequencing (NGS) in the context of childhood genetic diseases. WGS of fraternal twins concordant for dopa-responsive dystonia revealed known mutations in the sepiapterin reductase (SPR) gene. In contrast to other forms of dystonia, treatment with 5-hydroxytryptamine and serotonin reuptake inhibitors is beneficial in patients with SPR defects. Application of this therapy in appropriate cases resulted in clinical improvement. Likewise, extensive testing failed to provide a molecular diagnosis for a child with fulminant pancolitis (extensive inflammation of the colon), in whom standard treatments for presumed Crohn's disease, an inflammatory bowel disease, were ineffective. NGS of the patient's exome, together with confirmatory studies, revealed X-linked inhibitor of apoptosis (XIAP) deficiency. This diagnosis had not been entertained by the treating physicians because XIAP mutations had not previously been associated with colitis. Hemopoietic progenitor cell transplant was performed, as indicated for XIAP deficiency, with complete resolution of colitis. Lastly, for about 3700 genetic illnesses for which a molecular basis has not yet been established, WGS can suggest candidate genes for functional and inheritance-based confirmatory research.
The current cost of research-grade WGS is $7,666, which is similar to the current cost of commercial diagnostic dideoxy sequencing of two or three disease genes. Within the context of the average cost per day and per stay in a NICU in the United States, WGS in carefully selected cases is acceptable and even potentially cost-saving. However, the turnaround time for interpreted WGS results, like that of dideoxy sequencing, is too slow to be of practical utility for NICU diagnoses or clinical guidance (typically about 4 to 6 weeks). The method of the present invention provides WGS and bioinformatic analysis (largely automated) of suspected genetic disorders within 50 hours or 24 hours, which is a time frame that is unique to the system disclosed herein. There are also a promising timing for emergency use in level III and IV NICUs.
As the cost and turnaround time of WGS continue to decrease rapidly, there will be increasing opportunity to utilize WGS to guide the diagnosis, prognosis and treatment (pharmacogenomics) of inherited diseases. However, there are more than 7,500 genetic diseases, of which more than 3,500 monogenic diseases have been characterized at the gene level. Many of these diseases are rare or extremely rare. Further, many of these diseases feature extensive clinical and genetic heterogeneity. Clinical heterogeneity refers to the phenomenon that a genetic disease, caused by defects in a single gene, can be associated with different phenotypes in different affected individuals. Genetic heterogeneity refers to the phenomenon that a genetic disease, associated with a distinct clinical phenotype, can result from mutations in multiple different genes. Together these characteristics make it exceptionally difficult for a physician to know which genetic disease might be causative in an individual patient. Most physicians will never have learned about most genetic diseases. Thus, while it is feasible to decode the entire genome in individual patients, there is no general method whereby a physician can comprehensively select the diseases or genes that should be examined for causality in that patient. Dependent upon the level of specialization in clinical genetics, a physician can be familiar with as many as one hundred diseases that correspond to a particular presentation in an individual patient, or only to one or two diseases.
In practice, however, and as described below, there can be hundreds of diseases that match clinical presentations. Thus, in practice, the diagnosis of most genetic disorders is limited to approximately only 2,000 clinical geneticists in the United States. A diagnosis typically requires referral from a family physician or pediatrician to a subspecialist and then to a clinical geneticist in order for the appropriate tests to be ordered. For broad or optimal physician ordering of genome analysis that is relevant to individual patients of the system of the present invention maps individual abnormal signs, symptoms and laboratory values to the genetic diseases and genes that can feature these diseases. Such a system enables generalist physicians to order specific genomic regions to be interrogated, analyzed and interpreted in a manner that is tailored precisely to the presentations in individual patients. Additionally, such a system greatly facilitates in the interpretation of which genetic variations are likely to be causative in individual patients. As described below, an average genome sequence contains about 4.1 million genetic variations. An average exome (the sum of all of the exons of protein coding genes in the genome) harbors about 150,000 genetic variations. The system of the present invention for comprehensive mapping of clinical features to potentially causative genes allows, as described below, the analysis and interpretation of those variants to be limited by a factor of 100-fold to 10,000-fold. This greatly decreases the time and effort in interpreting genetic variations that are clinically relevant in WGS. Currently, this is a substantial impediment to broad use of WGS in disease diagnosis, prognosis and tailored treatment decisions (pharmacogenomics).
Physicians can use a web based portal to access the system or that is comminucably coupled to the system and enter the symptoms and other patient information. After the information is processed by the system, the system can display the results and/or possible list of diseases in the web based portal.
Currently available mappings of associations of clinical features to diseases to genes to mutations have numerous weaknesses. Firstly, these associations are not probabilistic (weighted by the likelihood of truth (or falsehood) of each association). Secondly, these associations are incomplete (they are a sparse subset of all associations of clinical features, genes, diseases and nucleotide variants, given incomplete knowledge or curation of that knowledge. Thirdly, these associations are not weighted in terms of determinism or causality. Fourthly, the associations contain errors. Fifthly, the associations are largely one-to-one associations as opposed to many-to-many associations. Sixthly, the associations are largely unintegrated and unconsolidated, meaning that clinical feature to disease associations are largely not integrated with, for example, gene to mutation mappings. Seventhly, they do not all use standardized vocabularies. Physicians currently attempt to reach a timely singular diagnosis in an individual patient on the basis of their accumulated knowledge and experience, assisted by one or more diagnostic tests (which are usually single categorical or continuous results), nomograms, clinical severity scores, and lists or applications (apps) of subsets of the known associations. However, these methods are largely subjective, incomplete, and subject to human frailty and finiteness with regard to intelligence and memory and ascertainment and bias and experiences. This is especially true for genetic diseases, many of which are uncommon and of which there are more than 7,360. Diagnostic testing is typically performed as a series of steps, rather than multiplexed testing in parallel. As a result, time to diagnosis is often excessive (months to years) and cost of diagnosis is correspondingly high. This is especially true for genetic diseases, for many of which molecular tests are not available, and existing tests are expensive and have long time-to-result (typically 12 weeks). Furthermore, the time available for a typical physician—patient encounter, in which clinical features are ascertained, is very brief. Delays in definitive disease diagnosis result in interim empirical therapies for patients, and disease progression and poor outcomes should those therapies be inappropriate or not the current optimal regimen for treatment of the specific underlying disease.
The ability to determine genome sequences of individual patients has created a new and strong need to integrate the millions of variants in a given patient with their clinical picture in order to arrive at a molecular diagnosis of disease. Conventional approaches to the interpretation of genomic variant information, have focused first (ad hoc) on the interpretation of genomic variant information, followed by (post hoc) the fitting of that interpreted genomic variant information to the patient's clinical features. These conventional methods examine the potential disease significance of all variants in a genome, greatly slowing analysis. These methods suffer from the drawback that many genomic variants are of unknown functional significance (VUS) and therefore do not point toward or disclose a disease diagnosis. This results in under-interpretation of VUS with regard to disease causality. These methods also suffer from the drawback that the analysis of variant relevance to disease is largely undertaken without estimation of the prior probability of a given disease entity in that patient. In other words, variant interpretation is largely performed aside from knowledge of the clinical features present or a weighting of the probability of each potential genetic disease by the clinical features present in a given patient. In addition, variant interpretation is often performed without analysis of the allele frequency of variants. These lead both to under- and over-interpretation of the causality of genetic variants in particular patients. Likewise, conventional approaches to genetic disease diagnosis based on a single-gene-at-a-time approach results in the over-interpretation of the causality of genetic variants in that single gene in particular patients.
Physicians will greatly benefit from assistance by somewhat similar computational systems that possess the capability to apply all known mappings to the overwhelming complexity of human phenotypes, human diseases and human nucleotide variations for use in the most efficient diagnosis and treatment individual patients within a probabilistic framework. This is becoming profoundly relevant as whole genome sequencing becomes applied to human disease diagnosis. Given current incomplete, erroneous or partially complete reference associations, it would be beneficial to have a system that is not only comprehensive but is also continuously updated by data sets drawn from each patient experience. In addition, it would be beneficial to have a system that uses some form of logical operators and training sets to assist the system in make probabilistic associations, as opposed to binary associations.
In addition, currently no method exists that can identify many or most disease causing mutations that involve large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements. For each known mutation a specific assay must be developed. One example of a known disease-causing mutation that involves a large inversion, that is clinically important and very difficult to diagnose by comprehensive sequencing methods, such as whole genome sequencing, is the intron 22 inversion (Inv22), in the factor VIII gene (F8), a duplicon-mediated rearrangement, found in about one half of patients with severe hemophilia A worldwide.
Another example of a common disease not readily diagnosed by comprehensive sequencing is Duchenne or Becker muscular dystrophy, for which ⅔ of cases are due to large deletions in the DMD gene. As such, the first line of current testing for DMD is gene specific deletion/duplication testing to identify mutations not readily detectable by sequence analysis. This can be performed by a variety of gene-specific methods: quantitative PCR, long-range PCR, multiplex ligation-dependent probe amplification (MLPA), and DMD-specific chromosomal microarray (CMA). However, these methods are inextensible to whole genome analysis, and DMD testing is not possible by comprehensive genomic approaches. An example of a gene for which the analysis is complicated by both large deletions, and gene conversions is SMN1. Deletions in the SMN1 gene cause spinal muscular atrophy, the most common genetic cause of infant death. 95%-98% of individuals with a clinical diagnosis of SMA are homozygous for a deletion of SMN1. 2%-5% of individuals with a clinical diagnosis of SMA are compound heterozygotes for deletion of at least SMN1 exon 7 and an intragenic inactivating mutation of SMN1. Thus, SMN1 cannot be tested for disease causality by comprehensive genomic approaches at present.
Currently, the scientific community is focused on the use of third generation DNA sequencing technologies and haplotyping using second and third generation sequencing technologies for various purposes, but not for the individual diagnosis of a specific genetic disorder in a particular patient with clinical features that suggest a gene defect for which many or most cases of disease-causing mutations involve large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements. Rather, these sequencing technologies are being used for genome-wide analyses to detect and analyze all genomic events of these types. Therefore, it would be beneficial to have a system that is a general method for molecular diagnosis of previously described disease-causing mutations that involve large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements that are clinically important and currently undetectable by next-generation sequencing. The use of third generation DNA sequencing technologies, or a combination of second and third generation sequencing technologies, or second generation sequencing technologies with haplotype estimation (either by assembly or imputation) would allow comprehensive genomic analysis for most genetic diseases to include examination of large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements that are clinically important and currently undetectable by next-generation sequencing. An alternative approach is to supplement a comprehensive genome sequencing technology with whole transcriptome (RNA) sequencing. RNA sequencing allows the quantity of expression of each gene to be determined. Thus, RNA sequencing can allow indirect detection of the effects of large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements that are clinically important and currently undetectable by next-generation sequencing by examination of the effect on transcription of that gene or genes. Furthermore, RNA sequencing, when performed together with DNA sequencing, allows examination of allelic expression bias. Allelic expression bias is exquisitely sensitive for detection of the effects of genomic variations that act in cis. Allelic expression bias can only be performed at expressed genomic locations that contain a heterozygous variant. Thus, at a heterozygous site, the proportion of expression from each allele (or chromosome) should be 50%. Deviations from 50% indicated that the variant, or a linked variant, are changing the expression of that locus. Thus, RNA sequencing with calculation of allelic expression bias can allow indirect detection of the effects of large nucleotide inversions, large deletions, insertions, large triplet repeat expansions, gene conversions and complex rearrangements that are clinically important and currently undetectable by next-generation sequencing by examination of the effect on transcription of that gene or genes on maternally and paternally derived chromosomes.
The systems and methods disclosed by the present invention, as described further herein, differ from current approaches to the interpretation of genomic variant information, which have focused first (ad hoc) on the interpretation of genomic variant information, followed by (post hoc) fitting of that interpreted genomic variant information to the patient's clinical features. Instead, these systems described herein utilize the clinical features present in a given patient to define the likelihood of each genetic disease in that patient prior to the interpretation of genomic variant information. This has the advantage that the many genomic variants of unknown functional significance (VUS) can be ranked by likelihood of disease in that patient, and therefore with appropriate weighting for interpretation with regard to disease causality. If variants are analyzed first in the absence of patient information, VUS are likely to be underinterpreted with regard to causality. It also benefits from the advantage that the analysis of variant relevance to disease is largely undertaken in the setting of a calculation of the prior probability of a given disease entity in that patient. This probability is informed by many sources, such as the disease frequency, frequency in various populations, effects of consanguinity, as well as the clinical features in a given patient. Variant interpretation is performed in reference to weighted probabilities of each potential genetic disease based on the clinical features present in a given patient. This leads to tailored appropriate interpretation of the causality of genetic variants in the context of particular patients as well as that of incidental findings.