Many large data sets are difficult to work with and a challenge to analyze successfully. In particular, it is often difficult to quickly determine relevant data or associations of interest over a large data set. This problem arises in a number of fields.
For example, as genome-wide association studies (GWAS) plus whole genome sequence (WGS) analyses for complex human disease determinants are expanding, it is increasingly necessary to develop strategies to facilitate large data sharing, rapid replication and validation of potential disease-gene associations. This is especially true for signals that straddle the threshold for genome-wide significance due to small effect size, lack of statistical power in the study population, or a combination of these.
Annotations of human genome variation have identified some 60 million single nucleotide polymorphisms (SNPs), which offer the promise of connecting nucleotide and structural variation to hereditary traits. Genotyping arrays that resolve millions of common SNPs have enabled over 2000 GWAS to discover principal genetic determinants of complex multifactorial human diseases. Today, whole genome sequence association has extended the prospects for personalized genomic medicine, capturing rare variants, copy number variation, indels, epistatic and epigenetic interactions in hopes of achieving individualized genomic assessment, diagnostics, and therapy of complex maladies by interpreting one's genomic heritage.
GWAS studies to date have produced conflicting signals because many SNP associations fail to replicate in independent studies. Further, GWAS frequently fail to implicate previously-validated gene regions described in candidate gene associations for the same disease, and in most cases offer less than 10% of the explanatory variance for the disease etiology. In addition, discovered gene variants are frequently nested in noncoding desert regions of the genome that are difficult to interpret. At least part of these weaknesses derive from discounting SNP association “hits” that fail to achieve “genome-wide significance”, a widely accepted, albeit conservative, statistical threshold set to discard the plethora of seemingly false positive statistical associations (Type I errors) that derive from the large number of SNPs interrogated.
A challenge to genetic epidemiology involves disentangling the true functional associations that fall below the genome-wide significance threshold from the myriad of statistical artifacts that also occur. No one has developed a real solution to this conundrum though some approaches have been offered. Many researchers agree that more widely practiced open access data sharing of unabridged GWAS data would offer the opportunity for multiple plausible approaches to bear on this question. However, for many cohorts, especially those developed before the advent of the genomics era, participants were not consented for open access of genome-wide data. Since patient anonymization is virtually impossible with genetic epidemiological data, the prospects of sharing patients' genotype and clinical data may conflict with ethical concerns over protecting the individual privacy of study subjects.
Needs exist for improved systems and methods for visualization, sharing and analysis of large data sets.