The present invention generally relates to data sequence processing methodologies and, more particularly, to methods and apparatus for detecting basic repeating subsequences within the data sequences.
All living organisms carry the information that allow them to live, in the form of long molecules called DNA (Deoxyribonucleic acid). The structure of these molecules can be idealized as a long chain of four types of chemical components (called nucleotides or bases): adenines, thymines, cytosines and guanines. It is customary to denote these nucleotides by the first letter of its denomination. Thus, we have four types of nucleotides A, T, C, and G. We can say that the set {A,T,C,G} is an alphabet (nothing less than the alphabet of life) in the sense that the order of these letters in the DNA is the genetic text that carries the information needed to build each of our proteins. Proteins are the actual molecules that support and connect the cells of living organisms, mediate communication with the environment and perform most of the chemical reactions needed for living. Proteins are also chain-like molecules which are built out of a set of twenty constituents called amino acids. These twenty amino acids can be conceptualized as an alphabet and the proteins as sentences in this alphabet. An elementary, yet deep, discussion of these alphabets can be found in B. Lewin, xe2x80x9cGenes VI,xe2x80x9d Oxford University Press, 1997. The set of all ordered DNA letters constituting the DNA of an organism is called its genome. The set of all sequences of ordered amino acids (the protein alphabet) of an organism is called its proteome.
Most genomes contain regions that consist of a basic sequence that repeats itself (albeit with slight variations), head to tail, a number of times. These arrangements are called tandem repeats (TR). Both the length of the basic sequence that forms the TR, hereafter called the period of the repeat, and the number of repetitions, hereafter called the copy number, vary from TR to TR within the same genome. The overall amount of repeated genomic material is different in different genomes, ranging from approximately 10% in mammalian genomes to 50% in some arthropods, see, e.g., B. Lewin, xe2x80x9cGenes VI,xe2x80x9d Oxford University Press, 1997.
Even though the functional origin of the repeated sequences is still a matter of debate,. see, e.g., B. Charlesworth, P. Sniegowski and W. Stephan, xe2x80x9cThe Evolutionary dynamics of repetitive DNA in Eukaryotes,xe2x80x9d Nature, 371, 215-220, 1994 (the prevailing conjecture being that short repeats might have a structural function), it is clear that their very presence has important consequences during chromosome pairing, allowing for misalignments to occur, and thus causing repeat clusters to be highly polymorphic, see, e.g., B. Lewin, xe2x80x9cGenes VI,xe2x80x9d Oxford University Press, 1997. This polymorphism is used in the forensic technique known as DNA fingerprinting, see. e.g., K. Inman, and N. Rudin, xe2x80x9cAn Introduction to Forensic DNA Analysis,xe2x80x9d CRC Press, Boca Raton, Fla., 1997. Longer repeats tend to have a very complex and hierarchical internal structure, whose details might give hints about DNA molecular mechanisms such as misaligned recombination and saltatory replication. For a more extensive discussion of different biological aspects of TR, see, e.g., B. Lewin, xe2x80x9cGenes VI,xe2x80x9d Oxford University Press, 1997; and S. B. Primrose, xe2x80x9cPrinciples of Genome Analysis,xe2x80x9d Blackwell Science, Cambridge, Mass., 1995.
Despite the fact that the discovery of these surprising repetitive structures took place about 25 years ago, a detailed characterization of the main parameters of the repeats, such as distribution of lengths, distribution of periods, phylogeny of its hierarchical structure, how often they belong to coding as opposed to non-coding regions, etc., is missing. The reasons for this are twofold. On the one hand, only recently enough data to address these questions became available. On the other, and in part due to the first reason, fast algorithms to interrogate the genomes came into existence only recently.
The first attempts to systematically find repeats in sequences dealt with exact repeats, see, e.g., R. M. Karp, R. E. Miller and A. L. Rosenberg, xe2x80x9cRapid identification of repeated patterns in strings, trees and arrays,xe2x80x9d In Proc. 4th Annu. ACM Symp. Theory of Computing, 125-136, 1972; and A. Milosavljevic and J. Jurka, xe2x80x9cDiscovering Simple DNA sequences by the Algorithmic Significance Method,xe2x80x9d Comput. Appl. Biosci., 9, 407-411, 1993. The constraint of exact repetition was relaxed (see, e.g., G. Landau and J. Schmidt, xe2x80x9cAn algorithm for approximate tandem repeats,xe2x80x9d Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, 120-133, Springer-Verlag, 1993; S. K. Kannan and E. W. Myers, xe2x80x9cAn algorithm for locating non-overlapping regions with maximum alignment score,xe2x80x9d Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, 74-86, Springer-Verlag, 1993; and G. Benson, xe2x80x9cA space efficient algorithm for finding the best non-overlapping alignment score,xe2x80x9d M. Crochemore and D. Gusfield (editors), Combinatorial Pattern Matching, volume 807 of Lecture Notes in Computer Science, 1-14, Springer-Verlag, 1994), where approximate repeats appearing exactly twice either contiguously (the G. Landau and J. Schmidt article) or in a non-overlapping fashion (the S. K. Kannan and E. W. Myers article; and the G. Benson article) were considered. An alternative strategy has also been followed. If the length T of the repeating unit is short enough that all words of that length (see, e.g., E. Rivals and O. Delgrange, xe2x80x9cA first step towards chromosome analysis by compression algorithms,xe2x80x9d N. G. Bourbakis (editor), First International IEEE Symposium on intelligence in Neural and Biological Systems, 233-239, IEEE Computer Society Press, 1995) or all the possible ordered k elements in a window of size T (k less than T) (see, e.g., X. Guan and E. C. Uberbacher, xe2x80x9cA Fast Look-up Algorithm for Detecting Repetitive DNA Sequences,xe2x80x9d Pacific Symposium on Biocomputing, 718-719, 1996) can be generated and each used as candidate for the repeating unit, then these candidates can be used as probes and xe2x80x9cfittedxe2x80x9d to the sequence under interrogation. The same approach can be followed if the repeating unit is given, see, e.g., V. Fischetti, G. Landau, J. Schmidt and P. Sellers, xe2x80x9cIdentifying periodic occurrences of a template with applications to protein structure,xe2x80x9d Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, 7486, Springer-Verlag, 1992. In the article. by G. Benson and M. Waterman, xe2x80x9cA method for fast database search for all k-nucleotide repeats,xe2x80x9d Nucelic Acids Research, 22, 4828-4836, 1994, an algorithm that detects TR of a given size was proposed.
All of the above algorithms are practical for a specific range of input parameters. For example, some ofthem are exhaustive algorithms (e.g., the S. K. Kannan and E. W. Myers algorithm; and the G. Benson algorithm), with time complexity being O(n2polylog(n)) where n is the length of the full sequence, and thus n cannot be too large. This is a serious constraint if entire chromosomes are interrogated (for example, n is about 1,500,000 for chromosome 4 of S. cerevisiae). For other algorithms (e.g., the R. M. Karp, R. E. Miller and A. L. Rosenberg algorithm; and the A. Milosavljevic s and J. Jurka algorithm), the constraint of exact repeats is untenable for biological sequences of DNA, where mutations, insertions and deletions constitute important ingredients. Tandem repeats appear in variable copy numbers, and therefore the algorithms that limit the search to only two repeats are too restrictive, see, e.g., the G. Landau and J. Schmidt algorithm. Moreover, the size of the repeating unit can be of a few hundreds, and thus the algorithms that are practical only for short repeats (e.g., the E. Rivals and O. Delgrange algorithm; and the X. Guan and E. C. Uberbacher algorithm) will be useful only in a limited range of sizes. The assumption of the repeat (e.g., the V. Fischetti, G. Landau, J. Schmidt and P. Sellers article), or its length (e.g., the G. Benson and M. Waterman article), being given results in limitations when no a priori knowledge about the repeats is available.
More flexible algorithms that find TR of arbitrary period and copy number have recently been reported, see, e.g., G. Benson, xe2x80x9cAn Algorithm for Finding Tandem Repeats of Unspecified Pattern Size,xe2x80x9d S. Istrail, P. Pevzner and M. Waterman (editors), Proc. 2nd Annual ACM Intl. Conference on Comp. Mol. Biol, RECOMB 98, 20-29, 1998; and M.-F. Sagot and E. Myers, xe2x80x9cIdentifying Satellites in Nucleic Acid Sequences,xe2x80x9d Proc. 2nd Annual Intl Conference on Comp. Mol. Biol, RECOMB 98, 234-242, 1998. In the M.-F. Sagot and E. Myers approach, the algorithm exhaustively searches for all possible consensus models whose edit distance to the real repeated units is no bigger than an upper bound; it works by extending prefix models. Benson""s algorithm (1998) works by finding k-tuples of consecutive characters from the sequence with identical contents (matching k-tuples) at a common distance. A region containing, a sufficiently large number of matching k-tuples is accepted as a TR region if it passes a number of statistical tests based on a probabilistic model.
A cursory comparison of the results reported in the G. Benson article (1998) and the M.-F. Sagot and E. Myers article shows that the sensitivity of both algorithms is different, in that the number of TR reported in both works for the same chromosomes are substantially different. For example, for yeast chromosome 8, no TR model of length greater than 6 was reported in the M.-F. Sagot and E. Myers article, whereas 16 TR of lengths between 10 and 45 were reported in the G. Benson (1998) article. Only TR of size 10 or more were considered in the G. Benson (1998) article, whereas the repeat sizes covered in the M.-F. Sagot and E. Myers article ranged between 2 and 45. It seems that the heuristic algorithm of Benson is very sensitive, being able to detect TR in a wide range of sizes and copy numbers.
However, there is a need for a TR detection algorithm that provides for a considerable improvement in sensitivity, and which eliminates or at least substantially reduces other drawbacks associated with existing approaches.
The present invention provides for methods and apparatus for detecting repetitive strings in input sequences of letters, characters, or the like. In one aspect of the invention, a computer-implemented method for identifying at least one region of interest in a one-dimensional input sequence comprises the following steps. First, patterns contained in said input sequence are identified. The patterns are then grouped into sets, wherein each pattern in a given set shares a period, where by period we mean the distance between patterns. Then, for at least one set of patterns, the following substeps are performed: (a) for each position P in the input sequence that lies in at least one pattern within the one set of patterns, a data value is generated representing a number of patterns in the set of patterns whose extremes (i.e., first and last characters or letters) flank said position P; a low-pass transformation is applied to the data values for the one set of patterns; subsequent to the transformation of the data values, at least one pair of the transformed data values that satisfy a predetermined criterion is identified. Then, the method provides for identifying the region of interest based upon position in the input sequence corresponding to the pair of data values.
The input sequence may comprise one or more ordered sequences of characters and, in one embodiment, each character of the one or more ordered sequences may be a nucleotide. Further, at least one data value between the pair of data values preferably exceeds a predetermined minimum value. The method may further comprise the step of, upon verifying that the region of interest is a repeating unit, reporting said region of interest to a user or an application.
In another aspect of the invention, a computer-implemented method for identifying regions of interest in a one-dimensional input sequence comprises the following steps. First, a set of repeating units contained in the input sequence is identified, wherein each given repeating unit satisfies at least the following conditions: (a) a first measure of similarity between adjacent repeating units in the set is greater than a first user defined threshold, and (b) the given repeating unit includes at least one unit having a second measure of similarity with any other unit in the set that is a greater than a second user defined threshold. Then, the method provides for reporting positions in the input sequence that are covered by the set of repeating units.
While the inventive algorithm described herein can be used for any sequence of letters, characters, or the like, we shall at times concentrate on its applications to genomics. However, it is to be appreciated that the methodologies of the present invention are not limited to any specific application.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.