The advent of next generation sequencing (NGS), and the reduction in cost of DNA sequencing, make possible large-scale human genome sequencing for research in medical genetics and population genetics. NGS sequencers used for analyzing reference sequences can produce several billions of very short fragment sequences (referred to as “reads”). The genome sequence of an individual is obtained through resequencing (including indexing, mapping and alignment), and by determining the locations of the generated reads in a reference sequence.
To accurately map reads when analyzing a base sequence, a reference sequence is often used. However, for various reasons (e.g., a sequencing error, a sampling error, a test error, etc.), a reference sequence may contain one or more bases that are of uncertain identity. For example, it is uncertain whether these bases are A, C, G or T. Such unidentified bases are generally denoted as a separate letter, such as “N.” To process the unidentified bases, conventional systems for analyzing a base sequence are known that consider the unidentified base as being selected from A, C, G and T, or which predict the identity of the unidentified base using, for example, probabilistic methodology. However, in these conventional systems for analyzing a base sequence, the speed at which a base sequence can be analyzed is considerably reduced, and/or the degree of accuracy in the analysis of a base sequence is reduced, due to the additional processing required for the unidentified base(s).