Genetic studies have seen rapid advances in recent years. The entire genomes of specific organisms, including some individual human beings, have been sequenced and become available as references. In genetic research, genetic testing, personalized medicine, and many other applications, it is often useful to obtain a sample of genetic material, determine a sequence of that sample, and to map that sample sequence to a location on an available reference. Once the mapping is done, a comparison can be made to a reference in order to identify polymorphisms or mutations or obtain other useful information.
Existing approaches typically map long, contiguous sample sequences to locations in a reference. However, some techniques used for obtaining sample sequences yield data sets comprising short sequences (sometimes referred to as oligomers) with predicted spatial relationships. Such ‘polyoligomer data sets’ consist of multiple oligomers that have variable but constrained amounts of spacing or overlap (referred to as separation distance) between oligomers. Where individual oligomers are too short to identify one or a small number of possible locations on a reference, and the spacing between oligomers is variable, existing approaches are not adequate.
It would be useful to have a way of accurately mapping relatively short oligomer sequences with variable separation distances to a reference in a manner that would both be robust to and identify data errors, mutations, or polymorphisms. It would also be desirable for such mapping to be efficient both in terms of computational speed and cost.