1. Field
The present teachings generally relate to nucleic acid analysis, and in various embodiments, to a system and methods for sequence data processing and consensus sequence analysis.
2. Description of the Related Art
Advances in automated nucleic acid sequence analysis have revolutionized the fields of cellular and molecular biology. As a result, it is now feasible to sequence whole genomes as is evidenced by the completion of sequencing the 3-billion-base human genome. When using automated systems, it is important to maintain a high degree of accuracy with respect to the identification of individual nucleotide bases. Oftentimes, base identification is predicated upon raw data obtained from electrophoretic and/or chromatographic information which is resolved to identify each base within a sequence undergoing analysis. Numerous factors may affect this analysis including, for example, the base composition of the sequence, experimental and systematic noise, migration anomalies (compressions and stretches), variations in observed signal strength for the detected bases, and variations in reaction efficiencies. The presence of mixed-bases in a sample may present further difficulties for conventional systems to properly resolve and identify. Mixed-bases may be representative of sequence variants contained within a sample and may arise from allelic variation or genetic heterozygosity. Mixed-bases may also represent regions within a sample sequence where more than one putative base can be identified. Conventional systems may overlook or erroneously identify these regions thereby degrading the accuracy of the sequence analysis. As a result, there is a need for an improved methodology by which mixed-bases can be identified and evaluated.