1. Field of the Invention
Embodiments disclosed herein relate to a method and system for determining the accuracy of DNA base identifications, based at least partly on sampling characteristics of subsets within training data sets.
2. Description of the Related Art
With the progress of the Human Genome Project and its massive undertaking to sequence the entire human genome, researchers have been turning to automated DNA sequencers to process vast amounts of DNA sequence information. DNA, or deoxyribonucleic acid, is one of the most important information-carrying molecules in cells. DNA is composed of four different types of monomers, called nucleotides, which are in turn composed of bases linked with a sugar and a phosphate group. The four bases are adenine (A), cytosine (C), guanine (G), and thymine (T). The original state of a DNA fragment is a double helix of two antiparallel chains with complementary nucleotide sequences. The coded information of a DNA sequence is determined by the order of the four bases in either of these chains. This sequence of bases is often referred to as the nucleotide sequence or nucleic acid sequence of the DNA. Several chemical methods have been developed for detecting and identifying the bases in order, and such methods can be performed on automated equipment. However, the reliability of such base predictions may be limited by the performance of the equipment and the particular chemistry being used. Moreover, the accuracy of determining, or “calling” a base may vary between separate sequencing experiments, or even from base to base. Thus, there is a need for predicting the bases with a DNA sequence and assessing a quality measure associated with the prediction.