Precision medicine, genetic testing, therapeutic development and whole genome, exome, gene panel and mini-gene reporter analysis require the ability to accurately interpret how diverse features encoded in the genome, such as protein binding sites, RNA secondary structures, and nucleosome positions, impact processes within cells. Most existing approaches to identifying disease variants ignore their impact on these genomic features. Many genome studies are restricted to mutations in exons that either change an amino acid in a protein or prevent the production of the protein.
Over the past decade, the importance of understanding regulatory genomic instructions and not just the protein-coding exons and genes that they control has been underscored by several observations: While evolution is estimated to preserve at least 5.5% of the human genome, only 1% accounts for exons within genes; biological complexity often cannot be accounted for by the number of genes (e.g. balsam poplar trees have twice as many genes as humans); differences between organisms cannot be accounted for by differences between their genes (e.g. less than 1% of human genes are distinct from those of mice and dogs); increasingly, disease-causing variants have been found outside of exons, indicating that crucial information is encoded outside of those sequences.
In traditional molecular diagnostics, an example workflow may be as follows: a blood or tissue sample is obtained from a patient; variants (mutations) are identified, by either sequencing the genome, the exome or a gene panel; the variants are individually examined manually (e.g. by a technician), using literature databases and internet search engines; a diagnostic report is prepared. Manually examining the variants is costly and prone to human error, which may lead to incorrect diagnosis and potential patient morbidity. Automating or semi-automating this step is thus beneficial. Since the number of possible genetic variants is large, evaluating them manually is time-consuming, highly dependent on previous literature, and involves experimental data that has poor coverage and therefore can lead to high false negative rates, or “variants of unknown significance”. The same issues arise in therapeutic design, where the number of possible therapies (molecules) to be evaluated is extremely large.
Techniques have been proposed for which predicting phenotypes (e.g., traits and disease risks) from the genome can be characterized as a problem suitable for solution by machine learning, and more specifically by supervised machine learning where inputs are features extracted from a DNA sequence (genotype), and the outputs are the phenotypes. Such an approach is shown in FIG. 2(a). A DNA sequence 204 is fed to a predictor 202 to generate outputs 208, such as disease risks. This approach is unsatisfactory for most complex phenotypes and diseases for two reasons. First is the sheer complexity of the relationship between genotype (represented by 204) and phenotype (represented by 208). Even within a single cell, the genome directs the state of the cell through many layers of intricate biophysical processes and control mechanisms that have been shaped by evolution. It is extremely challenging to infer these regulatory processes by observing only the genome and phenotypes, for example due to ‘butterfly effects’. For many diseases, the amount of data necessary would be cost-prohibitive to acquire with currently available technologies, due to the size of the genome and the exponential number of possible ways a disease can be traced to it. Second, even if one could infer such models (those that are predictive of disease risks), it is likely that the hidden variables of these models would not correspond to biological mechanisms that can be acted upon, unless strong priors, such as cause-effect relationships, have been built in. This is important for the purpose of developing therapies. Insisting on how a model ought to work by using these priors can hurt model performance if the priors are inaccurate, which they usually are.
Some other machine learning approaches to genetic analysis have been proposed. One such approach predicts a cell variable that combines information across conditions, or tissues. Another describes a shallow, single-layer Bayesian neural network (BNN), which often relies on methods like Markov Chain Monte Carlo (MCMC) to sample models from a posterior distribution, which can be difficult to speed up and scale up to a large number of hidden variables and a large volume of training data. Furthermore, computation-wise, it is relatively expensive to get predictions from a BNN, which require computing the average predictions of many models.