Genetic studies have seen rapid advances in recent years. The entire genomes of specific organisms, including some individual human beings, have been sequenced and become available as references. In genetic research, genetic testing, personalized medicine, and many other applications, it is often useful to obtain a sample of genetic material, determine a sequence of that sample, and to map that sample sequence to a location on an available reference. Once the mapping is done, a comparison can be made to a reference in order to identify polymorphisms or mutations or obtain other useful information.
Existing approaches typically map long, contiguous sample sequences to locations in a reference. However, some techniques used for obtaining sample sequences yield data sets comprising short sequences (sometimes referred to as oligomers) with predicted spatial relationships. Such ‘polyoligomer data sets’ consist of multiple oligomers that have variable but constrained amounts of spacing or overlap (referred to as separation distance) between oligomers. Where individual oligomers are too short to identify one or a small number of possible locations on a reference sequence, and the spacing between oligomers is variable, existing approaches are not adequate.
It would be useful to have indexes with the ability to accurately map relatively short oligomer sequences with variable separation distances to a reference. It would also be desirable to create indexes that would allow such mapping to be efficient both in terms of computational speed and cost