This invention relates generally to bioinformatics, and more particularly to improved methods for the rapid and efficient analysis of genetic variants in genomic data using a distributed parallel framework.
Analysis of DNA sequence data is becoming a common for research and diagnostics. A single-nucleotide polymorphism (SNP) is a DNA sequence variation occurring when a single nucleotide A, C, G, T, blank (which indicates a deletion), multiple loci, e.g., AC, AG, etc., in the genome at a particular location differ between individuals, e.g., humans. Variations in the DNA sequences (which we generally refer to as SNPs although these can be extended to other variants, including structural variants) of humans can affect how individuals develop diseases and respond to pathogens, chemicals, drugs and other agents. They are important for personalized medicine, and SNPs are used as in genome-wide association studies related to diseases and normal traits.
Processing of genetic data to identify associations, relations and possible anomalies is generally a resource-intensive and time-consuming process. There are about three billion base pairs in human DNA, some number of which may contain a variant, and generally it is necessary to look at each SNP in turn and compare it to other SNPs and to other samples to determine the probability of the variant being related to a particular disease or trait. There are many different ways in which genetic data can be represented and stored. For example, an individual's genetic data could be stored as a row in a flat file. The flat file may contain the genotypes in a predefined order for genetic markers. Each flat file may contain multiple rows of data for multiple SNPs for each individual. Other covariate or phenotype information related to an individual could be stored in the same file or in other files with identifiers allowing linking the genetic and other covariate data of an individual. In this version of flat file design, the SNPs must be in the same order for each individual. Another common and intuitive storage format is the variant cell format (VCF) or VCF-like formats. In these formats, each row contains data on a single SNP or variant. SNP data for multiple individuals may also be included on each line or row for a particular variant. In this document, VCF will refer to both strictly VCF and VCF-like formats. These formats may also contain additional information about SNPs, including quality scores and filters and can contain further rich information. In general, all the aforementioned flat files are stored in what we refer to as row-wise storage of SNP data.
Because of the vast amount of genetic data and the various forms in which it is available, processing of the data is very inefficient and resource intensive. Running comparisons across individuals may require that the row-wise flat file data first be extracted out of the format in which it is stored, merged with others or otherwise accessed independently. This must be repeated each time more data is added, and some steps repeated every time that a comparison is made, and is memory bound. It is a computationally heavy process and is a mandatory first step, regardless of the complexity of the comparison or analysis. For example, a simple count of allele distribution over a population stratified by some disease could require scanning multiple files and filtering using information derived from another location (for example on disease status). When looking at a comparison between two separate SNPs, for example, it could be necessary to scan through the data twice looking for the two separate SNPs, or building custom query or writing code to access a particular row or column depending on the way the data is represented. The data set may be very large, e.g., possibly 1,000,000 data items per individual, and scanning it is a very time-consuming and resource-intensive process. Although the process is inherently parallel in nature, as the identical operation is being performed across every individual, the row-wise data format is serial and it must be scanned repeatedly each time an additional individual or SNP is examined to locate items of interest. This inhibits the ability of researchers easily to conduct exploratory ad-hoc “what if” searches to look for correlations between a variety of different SNPs and different data items.
It is desirable to provide methods for enhancing the efficiency and reducing the time and resources required for processing genetic data, and it is to these ends that the present invention is directed.