Relational databases are being considered for storing sequence data that has the characteristics of long repeating sequences generally consisting of a relatively small number of distinct values. However, there are currently no acceptable solutions to efficiently query such sequences. Because of this, the sequence data is generally not being stored relationally or only small portions of the sequence data is being stored. The lack of relational sequence data and the storage of only small portions of the sequence data make it difficult to mine the data. An example of such sequence data is genomic sequence data.
Genomic sequence data is typically represented as long sequences of letters or numbers each having a small number of distinct values. Storage and query of genomic sequence data is problematic in a relational database because of its size. The entire human genome consists of approximately 3 billion base pairs. Base pairs are two nucleotides on opposite complementary DNA or RNA strands connected via a hydrogen bond. Each nucleotide is typically represented by the letters A, C, G, and T, which correspond to the long names adenine, cytosine, guanine, and thymine. Sequences are represented by combining the letters of the nucleotides. For example, a small sequence may be represented by AGAATTCA.
Variations in DNA sequences of humans can affect how a human develops a disease or reacts to treatment. A single variation in a nucleotide within a DNA fragment is called a Single Nucleotide Polymorphism or SNP. For example, one individual may have a DNA fragment of AGAATTCA and another individual may have a similar DNA fragment of AGAATCCA. In such a case, the two possible sequences are called alleles and are typically named after the variation such as a T allele for the first individual and a C allele for the second individual. There are an estimated 5-10 million SNPs in the human genome; however, only 0.1% of the DNA is different from one individual to another.
Humans are diploid organisms, which mean they have two copies of every chromosome. Therefore, in humans, there can be three possible combinations of alleles, for example CC, CT, and TT. The combination an individual has for a specific trait is called their genotype. Therefore, the possible genotypes with regard to the SNP in the above example would be CC, CT, or TT. A notation using 0, 1, and 2 is also used to represent an individual's genotype where 2 represents the case where the chromosomes contain different alleles (e.g. CT), 0 indicates the major (aka wild or common type) allele and 1 indicates the minor (aka mutation) allele.
An individual's genotype does not specifically identify which SNPs are on which chromosomes. The identification of which SNPs are on each chromosome is the haplotype of an individual. A 0/1 vector is generally used for the haplotype where a zero indicates the major allele and a one indicates the minor allele. Chip arrays have been developed that can detect the presence of SNPs in a DNA sample. Current chips support detection of up to 900,000 SNPs. Comparisons of retrieval and analysis of sequences, SNPs, genotype, and haplotype are all critical to understanding the genetic association of disease and treatment efficacy.
Genomic data is filled with large strings of characters, which have a small number of distinct values, presenting a challenge for storing the data in a relational database as well as mining the data stored. In a relational database, data is arranged into rows and columns, with each row generally corresponding to a record and each column generally corresponding to a field of data for each record. Some possible approaches to storing the sequence data include, storing the sequences for a patient a single column, using a column for each element of the sequence (e.g. nucleotide, SNP), or using a row for each element of the sequence for each patient.
A problem with the first approach is that the data is seen as a large string to the database, which introduces inefficiencies when trying to quickly identify specific variations within a sequence. Some databases have the ability to define arrays to the database; however, the individual elements cannot be indexed. The second and third approaches enable the database system to easily navigate the data; however, the second approach generates more columns than contemporary relational database systems can typically support and the third approach results in inefficient processing of the table due to the large number of rows. Moreover, Bioinformatics and computational biology, which involve the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level, typically generate data as sequences and breaking the sequences apart into columns or rows is not consistent with efficient data generation.
What is needed, therefore, is an efficient index design for a column that can be used by database queries to more efficiently search sequence type data stored in relational databases.