Comparing the genomes of individuals allows for identification of signature patterns of genetic variations between those individuals. For example, knowledge of the range and types of genetic variations between individuals, or between normal and diseased tissues of a given individual, and in particular knowledge of variations that are unique to the individuals or shared between the individuals, are important in understanding disease behavior and disease progression and thus are important for planning therapeutic interventions. For example, FIG. 1A illustrates a pair of exemplary genomes 1 and 2 and a reference genome, which respectively are referred to in FIG. 1A using the shorthand “G1,” G2,” and “Gref,” along chromosome 1, which is referred to in FIG. 1A using the shorthand “C1.” The genomes each contain a sequence of nucleic acids represented by the letters “a,” “g,” “c,” and “t,” which respectively refer to adenine, guanine, cytosine, and thymine. As can be seen in FIG. 1A, genome 1 (G1, C1) (SEQ ID NO:1) and genome 2 (G2, C1) (SEQ ID NO:2) have substantially the same sequence of nucleic acids as one another along chromosome 1, except at position 5, at which genome 1 has an “a,” while genome 2 has a “c,” and the reference genome has an “a.” Such a variation in a single nucleic acid in genome 2, relative to the reference genome, can be referred to as a single nucleotide polymorphism, or “SNP.” Additionally, in the exemplary sequences illustrated in FIG. 1A, genomes 1 and 2 both have the nucleic acid sequence “atc” at positions 20-22 on chromosome 1. Although the two genomes' sequences are the same, the reference genome (Gref, C1) (SEQ ID NO: 3) instead can have the sequence “cat” at this position. Accordingly, genomes 1 and 2 both can be considered to have a variant at positions 20-22 on chromosome 1, in which the sequence “atc” is substituted for “cat.” By determining the similarities or differences in genetic variants between genomes, e.g., by determining which variants that exemplary genomes 1 and 2 illustrated in FIG. 1A are shared or are unique to the respective genome, and that are the same as or different than a reference genome, it can be possible to deduce the effect that different genetic variations can have on disease, and thus can be useful in developing a way to treat such disease.
FIG. 1B illustrates a set of basic building blocks that can be used in the logical analysis of genetic variant data using genomic set theory, and that are intended to illustrate different ways in which two genomes can be compared to one another. Specifically, FIG. 1B illustrates the “union” operation A U B, defined to mean the set of all items that exist either in A or B. FIG. 1C illustrates the “differentiate” operation A\B, defined to mean the set of all items that exist in A but not B. FIG. 1D illustrates the “intersect” operation A∩B, defined to mean the set of all items that exist in both A and B. FIG. 1E illustrates the “symmetric differentiate” operation (A\B) U (B\A), defined to mean all items that exist in A or in B, but not in both.
Operations such as illustrated in FIGS. 1B-1E can be used to perform logical analysis of sets genomic data. For example, FIG. 1F illustrates differentiation of genome 1 from genome 2, while FIG. 1G illustrates symmetric differentiation of genome 1 from genome 2. Such operations output variants that are unique either to genome 1 or to genome 2. If these genomes correspond to healthy individuals, the output variants can explain differences in normal phenotypes that lack predisposition to diseases under “ideal” conditions. Alternatively, if genome 1 corresponds to healthy tissue in a given individual and genome 2 corresponds tumor tissue from that individual, the differentiation of genome 1 from genome 2 using operations such as illustrated in FIGS. 1F and 1G can isolate tumor specific variants, which can help in identifying “driver” and “passenger” mutations in a tumor, as well as key genes involved in tumor related processes. In comparison, FIG. 1H illustrates an intersection between genome 1 and genome 2, which outputs variants that are shared between genome 1 and genome 2 and can indicate, for example, conserved areas of genomes or regions of common or shared lineage.
However, note that each individual genome to be compared includes a vast amount of data. For example, each of the 23 chromosomes of a human genome can contain about 48 million to 250 million base pairs, for a total of over 3.2 billion base pairs. Although relatively short sections of a given chromosome can be compared to one another on a manual basis, such as illustrated in FIG. 1A, computer-based approaches to genome comparison are the only practicable way of processing such high volumes of data. In such approaches, each individual's genome can be digitally represented as a series of letters representing nucleic acids such as illustrated in FIG. 1A, and two genomes can be compared to one another using computational algorithms known in the art. For example, the nucleic acid sequences in the digital representations of two genomes can be aligned relative to one another, the letters of the sequence can be compared to one another, and information about the variations in the sequence and positions thereof can be recorded in a suitable digital format, e.g., using a file format known in the art as variant call format, or VCF. A VCF file can include a list of the chromosomes, positions, reference alleles, alternate alleles, and zygosity of genetic variants in a particular genome, among other items of information. For further details on the VCF format, see Danecek et al., “The variant call format and VCFtools,” Bioinformatics 27(15): 2156-2158 (Jun. 7, 2011).
However, performing operations such as illustrated in FIGS. 1B-1H based on genomic data or VCF files can be computationally intensive, and can require a relatively large amount of memory to perform on an experimentally useful time frame. For example, on Apr. 9, 2013, IBM Corporation and CLC bio issued a press release announcing that they would offer a “next generation sequencing analytics solution” that includes between 48 and 192 CPU cores and between 192 and 768 GB of memory, and software for analyzing, comparing, and visualizing high-throughput sequencing data. See press release, “IBM and CLC bio deliver combined turnkey genomics sequencing analytics solution” dated Apr. 9, 2013, issued by CLC bio and available online at www.clcbio.com/wp-content/uploads/2013/04/IBM-and-CLC-bio-deliver-combined-turnkey-genomics-sequencing-analytics-solution 1.pdf.
Methods for compressing genomic data into a computationally more manageable size also have been developed. For example, U.S. Pat. No. 7,657,383 to Allard et al. is directed to a method of representing a genome as the set of differences between a subject genome and a reference genome. Specifically, Allard discloses comparing the subject genome to the reference genome, and determining whether a difference has been found. In response to the identification of a difference, a marker is located within the reference genome, and a corresponding marker is located in the subject genome. Allard discloses that the information portions of the sequence around the genetic markers then are compared, and any offset values associated with, or assigned to, the difference. Allard discloses that a label or indicator is assigned to the difference, such as the marker number, and a text description of the difference can be assigned, such as the type of difference, such as an addition, deletion, translocation, SNP, or repetitive microsatellite. Allard discloses that the accumulated data, such as indicators or marker numbers, starting and/or ending offsets, translocation information, and/or other information can be stored. Allard discloses that the entire set of descriptive data can specify the subject genome.
However, comparing two genomes along their entire lengths, as disclosed by Allard to generate a set of data specifying a subject genome, is computationally intensive. Moreover, the number of data sets thus generated scales linearly with the number of subject genomes analyzed, thus requiring at least a linear increase in the computational effort to perform such analysis and in the amount of storage space required to store the data sets.
Thus, what is needed is a computationally efficient method of storing and analyzing genomic data.