1. Field of the Invention
The present invention generally relates to data sequence processing methodologies, and more particularly, to methods and apparatus for detecting consensus motifs in sequences of data such as, for example, sequences of characters, character sets and real numbers.
2. Background Art
Given an input sequence of data, a “motif” is a repeating pattern, possibly interspersed with don't-care characters, that occurs in the sequence. The data could be characters or sets of characters or real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be uncountably infinite number of motifs. Typically, the higher the self-similarity in the sequence, the greater is the number of motifs in the data. Motif discovery on such data, such as repeating DNA or protein sequences, is a source of concern since such data exhibits a very high degree of self-similarity (repeating patterns).
At the same time, the problem of detecting common motifs across DNA sequences for locating regulatory sites, transcription binding factors or even drug target binding sites is of prime importance. The main difficulty is that these motifs have subtle variations at each occurrence. This problem has been of interest to both biologists and computer scientists. A satisfactory practical solution has been elusive although the problem is defined very precisely:
Problem 1 (The Consensus Motif Problem): Given t sequence si on some alphabet Σ, a length l>0 and a distance d≧0, the task is to find all patterns p, of length l that occur in each si such that each occurrence p′i on si has at most d mismatches with p.
The problem in this form made its first appearance in 1984 (M. S. Waterman, R. Aratia, and D. J. Galas. Pattern recognition in several sequences: Consensus and alignment. Bulletin of Mathematical Biology, 46(4):515{527, 1984). In this discussion, the alphabet Σ is {A, C, G, T} and the problem is made difficult by the fact that each occurrence of the pattern p may differ in some d positions and the occurrence of the consensus pattern p may not have d=0 in any of the sequences. In the above-mentioned paper, M. S. Waterman, R. Aratia, and D. J. Galas. Pattern recognition in several sequences: Consensus and alignment. Bulletin of Mathematical Biology, 46(4):515{527, 1984, Waterman and coauthors provide exact solutions to this problem by enumerating neighborhood patterns, i.e., patterns that are at most d Hamming distance from a candidate pattern. Sagot gives a good summary of the (computational) efforts in M. F. Sagot, “Spelling approximate repeated or common motifs using a suffix tree. Latin 98: Theoretical Informatics, Lecture Notes in Computer Science, 1380:111-127, 1998 and offers a solution that improves the time complexity of the earlier algorithms by the use of generalized suffix trees. These clever enumeration schemes, though exact, have a drawback that they run in time exponential in the pattern length.
This problem of detecting common subtle patterns across sequences is nevertheless of great interest and various statistical and machine learning approaches, which are inexact but more efficient, have been proposed. One of the questions that can be asked to compare and test the efficacy of such methods of consensus motif detection systems is: Given a set of sequences that harbor (with mutations) k motifs, what percentage of the k motifs does the system recover? When k is large, many approaches are known that give good average-case performance under this criterion.
Yet another question to ask is: Given a set of sequences that harbor (with mutations) ONE motif p, does the system recover p? This is a rather difficult criterion to meet since these algorithms use some form of local search based on Gibbs sampling or expectation maximization or even clever heuristics. Hence it is not surprising that they may miss p. However, a question of this form is a biological reality. Consider the following, somewhat contrived, variation of Problem 1 which is an attempt at simplifying the computational problem.
Problem 2 (The Planted (l, d)-motif problem): Given t sequence s′i on Σ, a pattern p of length l is embedded in s′i, with exactly d errors (mutations), to obtain the sequence si of length n, for each 1≦i≦t. The task is to recover p, given si, 1≦i≦t and the two numbers l and d.
Pevzner and Sze set forth the challenge problem, which was Problem 2 with parameters n=600, t=20, l=15 and d=4 (P. A. Pevzner and S.-H. Sze, “Combinatorial approaches to finding subtle signals in DNA sequences”, In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, Pages 269-278, AAAI Press, 2000). There also is a need for the deployment of combinatorial approaches to tackle this problem. One of the algorithms they presented was an exact algorithm, where the challenge problem was reduced to finding a t-sized clique in a t-partite graph with at most n−l+1 vertices in each partition. Even the best-known heuristics for clique finding problem failed to detect the clique corresponding to the signal. The second algorithm was based on enumerating possible patterns and checking their candidacy for being the subtle pattern using clever heuristics and an exhaustive search in a reduced space.
One effective algorithm is the one discussed by Buhler and Tompa, “Finding motifs using random projections”, In Proceedings of the Annual Conference on Computational Molecular Biology, (RECOMB01), Pages 69-75, ACM Press, 2001. The probabilistic algorithm uses a random projection h and hashes each input l-mer x into bucket h(x). Any hash bucket with sufficiently many entries is explored as a potential embedded motif. This approach solved the challenge problem and some more.