1. Field of the Invention
The present invention generally relates to a method of detecting atypical sequence fragments contained in longer one-dimensional sequences composed of letters from a fixed alphabet. More specifically, a method of representing sequences and sequence fragments, as based on an experimentally-determined optimal range of pattern template lengths, in combination with a correlation function and a threshold calculation, improves upon conventional methods for carrying out this task. For clarity purposes, we develop and discuss our method in the case where the said sequences are genomic sequences and in particular DNA: in this case, atypical sequences are generally thought to be the result of horizontal transfer events. But the described method is applicable in a more general context, as this would be immediately apparent to someone skilled in the art.
2. Description of the Related Art
As already mentioned above, the system and method that we describe below is generally applicable to the case where a long sequence comprised of letters from a given alphabet contains a fragment which is “atypical,” i.e. unlike the remainder of the sequence. For the purpose of presenting a clear description of the approach, we have selected to develop the discussion for the specific case where the sequences at hand are genetic sequences, and in particular DNA; therein we seek to locate atypical sequence fragments. Generally, atypical sequence fragments overlap with regions defining genes but this is not a requirement.
Let us now continue our discussion in the context of genomic sequences and complete genomes. In recent years, an increasing number of the genomes for a number of organisms has been experimentally determined. Through study of these genomic sequences and with the help of other types of analyses, it has been discovered that organisms can acquire genetic sequences from organisms that are not necessarily related to them, through a process that has been termed “horizontal gene transfer” (HGT) or “lateral gene transfer” (LGT). In the discussion that follows, the terms “horizontal gene transfer” and “lateral gene transfer” and the abbreviations HGT and LGT will be used interchangeably.
An exemplary problem addressed by the present invention is that of determining whether an observed genetic sequence from an organism is native to that organism's genome or represents an “atypicality” acquired through the HGT process. And as already mentioned, this exemplary problem is a special case of the problem of identifying atypical sequence fragments within longer sequences.
Before proceeding with a discussion of the present invention, it is noted that a genome can be thought of as a sequence of nucleic acids; among other things, a genomic sequence contains the definitions of the corresponding organism's genes. The above-mentioned exemplary problem, therefore, can be described as the process of deciding for a given segment of a given genetic sequence, referred to as the “query,” whether the sequence has been native to the organism's genome or is the result of a transfer event. The query could be distinct from a gene coding region, partially overlapping a gene coding region, or wholly-contained within a gene coding region.
This scenario is analogous to the problem of determining which words, if any, from a given sequence of natural language words have actually originated in another language. It is noted that this ‘donor’ language may not be known necessarily. The sought determination is to be made by looking at the words, and, without knowing the meaning of each word nor having a dictionary upon which to rely in order to make the decision. Analogously, in the specific problem under consideration that we use to develop our method, and since genomic sequences are available for only a relatively small number of organisms, one cannot rely on the availability of a repository of reference sequences. In general, it is entirely possible that a given genetic sequence has been transferred horizontally from a donor organism whose genome has yet been sequenced.
Techniques for determining whether a genetic sequence of a given size is atypical have been known and will be discussed shortly. However, there is a growing need for new methods that exhibit an increased sensitivity in detecting genetic atypicalities while at the same time reducing the number of false positive predictions.
The present invention addresses the problem of characterizing a sequence fragment of a given sequence in terms of its atypicality. We develop the invention for the concrete case where the given sequence is a genome sequence. Recall that for a given organism, those genomic fragments whose origin can be traced to an exogenous source, with high probability, are ideal candidates for being atypical and putative instances of horizontal gene transfer (HGT).