A single-nucleotide polymorphism, or SNP, is a single nucleotide differences between DNA sequences from individuals of the same species or between the paired chromosomes in an individual. For example, a segment of DNA may include the nucleotide sequence TTTCTTGTA in one individual (or on the first paired chromosome), and the corresponding segment of DNA may include the nucleotide sequence TTCCTTGTA in another individual (or on the second paired chromosome). Each of these different sequences are called alleles.
Many SNPs are not harmful. Most SNPs are found between genes (e.g., in the exons) or in the non-coding regions of genes. These non-coding SNPs are useful in DNA fingerprinting technologies.
Even when SNPs occur in the coding regions of genes, the SNP may be synonymous with a wild type gene, and thus the SNP may not affect the amino acid sequence ultimately transcribed. For example, TTT and TTC will both be transcribed as the amino acid phenylalanine.
The genetic variation provided by coding-region SNPs leads to the normal variation in phenotype in a given species. The alleles are what give people different genetic traits, such as blonde, brunette, red, or black hair, for example. Some coding-region SNPs, however, can cause genetically linked diseases or disorders. Because some illnesses can be traced back to SNPs, geneticists have been interested in mapping and detecting SNPs.
A mutation in one gene is enough to cause some diseases such as Huntington's disease and polysystic kidney disease 1 and 2. More often, however, multiple SNPs are involved in causing complex disorders like asthma, cancers, diabetes, heart disease, and many others. In these complex disorders, the existence of one or more SNPs may act as an indicator that a person is at a higher risk of developing the disorder. SNPs are also associated with the metabolism of drugs, giving rise to the possibility of individualized medicine where treatment is provided to an individual depending upon his or her genetic make-up.
SNPs are detected in a number of ways. For example, one method uses SNP chips, which are small silicon glass wafers with single-stranded DNA fragments attached. Each attached single-stranded DNA fragment has a unique sequence that corresponds to a known SNP. A sample of the DNA is converted into single-stranded DNA, and a fluorescent dye label is added. The labeled sample DNA fragments are incubated on the chip, and the labeled sample DNA with a nucleotide sequence matching a known SNPs would hybridize to the known SNP bound on the chip. The DNA that did not bind is washed away, and then a computer scans the chip to detect the location of the fluorescent labels, thereby detecting sample DNA bound to the DNA with known SNPs and, thus, identifying the SNPs in the DNA sample. This procedure is time consuming, however, and only known SNPs are detected.
Related to the detection of SNPs is the sequencing of DNA. In order to develop the set of known SNPs, one must first sequence DNA to act as the reference for unknown samples. SNP chips are a viable method for identifying SNPs because the human genome (and other genomes) have been sequenced in their entirety. By comparing several genomes within the same species and/or the same gene from several genomes, consensus sequences are created, and variations from the consensus sequences are identified as SNPs.
Shotgun sequencing is a commonly used method for sequencing entire genomes. In shotgun sequencing, DNA is fragmented into random segments. These segments are sequenced, and the determined sequences of nucleic acid fragments are called “reads.” The fragmenting process generates overlapping reads, which are aligned based on their overlapping regions.
Even though sequence alignments are done by computer, sequencing is still a time-consuming process. Bowtie, a software program for aligning sequences, claims to be able to align 25 million 35 base pairs reads each per hour. Bowtie also creates an index for a genome using a Burrows-Wheeler index. Thus, using the Bowtie program to build an index for a human genome, which includes approximately 3 billion base pairs, would take over 8 hours. Furthermore, detecting known human SNPs using the SNP chip method may require hours to prepare and process the chips, as described above, and novel SNPs cannot be detected with SNP chips.
Related to the field of sequencing DNA is the study of metagenomics. Metagenomics is the study of the myriad of genomes obtained directly from the environment, which is especially important in the study of microorganisms that cannot be cultured or easily studied in the lab. Metagenomics is used to understand the genetic diversity in an environment. In metagenomics, all of the genetic material of an environmental sample may be studied as a whole without first separating and identifying the genetic material with a particular species. One aspect of metagenomics research, however, is focused on determining which species are present in undifferentiated samples by sequencing DNA in the sample and comparing it to known DNA sequences. DNA sequencing in metagenomics is also used to discover previously unknown species when the sequencing reveals a novel genome. Often, the novel genome can be categorized by genus, even if the species has never been identified before.
Metagenomics also involves developing a way to determine whether a particular species is in a sample containing DNA from several species. One method of determining the species in a sample may involve sequencing the genetic material from a sample and then comparing the sequences to libraries of known sequences to determine what species are present. Often, the sequences are not resolved into entire genomes before comparing them to known sequences. Instead, the sequence “reads” are compared to sequence libraries to determine the percentage of sample sequences that match species' sequences in the library. The higher the percentage match for a certain species, the more likely that DNA from species is present in the sample. Given the current sequence analysis technologies, this is a time-intensive task.
In light of the above, there is a need for faster methods of analyzing sequence DNA and detecting SNPs. There is also a need for faster ways to sequence and identify multiple genomes from environmental samples and detect the presence of a single genome in samples containing many genomes. Finally, there exists a need for the simultaneous sequencing and comparing of the sequence, such that SNPs can be identified before the entire genome is sequenced so a genome or sequence can be identified before the sequencing is complete.