Massively Parallel Sequencing (MPS) approaches such as those now in wide commercial use (Illumina/Solexa, Roche/454 Pyrosequencing, and ABI SOLiD) are attractive tools for sequencing. Typically, MPS methods can only obtain short read lengths (hundreds of base pairs, bp, also called nucleotides, nt, with Illumina platforms, to a maximum of 200-300 nt by 454 Pyrosequencing) but perform many thousands to millions of such short reads on the order of hours. Sanger methods, on the other hand, achieve longer read lengths of approximately 800 nt (typically 500-600 nt with non-enriched DNA) but take several times longer to do so.
While sequencing machines were originally created for the purposes of sequencing genomic DNA, they have since been put to a myriad of other uses. Considering a sequencer simply as a device for recording the count of specific DNA sequences, sequence census experiments utilize high-throughput sequencing to estimate abundances of “target sequences” (also called “reference sequences”) for molecular biology and biomedical applications. Unusual populations of certain reference sequences can be diagnostic of disease.
To compare the DNA of the sequenced sample to its reference sequence, current methods are designed to find the corresponding part of that sequence for each read in the output sequencing data. This step is called aligning or mapping the reads against the reference sequence. Once this is done, one can look for one or more variations (e.g., a single nucleotide polymorphism, SNP, or a copy number variation, CNV, or a structural variation like presence/absence variation, PAV, or multiples or combinations thereof) within the sample. Aligning the read to the reference consumes a considerable amount of computing power.
For example, Sehnert et al 2011 and Biananchi et al 2014 describe methods to identify aneuploidy in a fetus from maternal blood samples, thus avoiding expensive and dangerous invasive procedures. Aneuploidy is a condition in which the number of chromosomes in the nucleus of a cell is not an exact multiple of the monoploid number of a particular species. An extra or missing chromosome is a common cause of genetic disorders including human birth defects. The fetal DNA in maternal blood is a very small portion of the sample (e.g., less than 10% and often as little as 0.5%) and the identification of its sequences is thus subject to systematic and random errors in the sample preparation, sequencing and alignment processes.
Similarly, cancerous tumors may have CNVs, PNVs, other structural mutations, or express different genes than the populations of normal cells in an individual. The tumor DNA in a patient tissue sample is a relatively small portion of the sample (e.g., less than 15% and sometimes as little as 0.5%) and the identification of its sequences is likewise subject to systematic bias and random errors in the sample preparation, sequencing and alignment processes.