Pattern or motif discovery in data is widely used as a means of understanding large volumes of data such as DeoxyriboNucleic Acid (DNA) or protein sequences. There are a variety of currently existing pattern discovery techniques. Many of these techniques discover “rigid” motifs. Some of these techniques have been extended to “flexible” motifs.
A “rigid” motif is a repeating pattern that has the same length in every occurrence in an input sequence of data. The pattern contained in a rigid motif can contain “don't care” characters, which are generally symbolized by dots. A don't care character means that any character can occupy this particular location. For example, given a string s=abcdaXcdabbcd, the rigid motive m=a·cd occurs twice in the data, at positions 1 and 5 in s. A “flexible” motif is a repeating pattern that has a variable number of don't care characters. For instance, in the previous example, a flexible motif occurs three times, at positions 1, 5 and 9. At position 9, there would be two dot characters to represent two gaps instead of one. This flexible motif may be written as m=a·[1,2]cd, where the [1,2] indicates that one or two don't care characters are allowed.
Allowing motifs to have a variable number of don't care characters increases the number of discovered motifs but also increases discovery time and algorithm complexity.
Typically, the higher the number of repeating patterns in a sequence, the higher the number of motifs in the data. Motif discovery on such data, such as repeating DNA or protein sequences, is a source of concern because these data exhibit a very high degree of self-similarity (i.e., repeating patterns). The number of rigid motifs could potentially be exponential in the size of the input sequence and, in the case where the input is a sequence of real numbers, there could be an infinite number of motifs (assuming two real numbers are equal if they are within some δ of each other).
Usually, this problem of a large number of motifs is tackled by pre-processing the input, using heuristics, to remove the repeating or self-similar portions of the input, or by using a “statistical significance” measure. These types of models, therefore, reduce the number of motifs to a more manageable level. However, due to the absence of a good understanding of the domain, there is no consensus over the right model to use. In other words, if the domain is DNA, there may be insufficient information to know whether a statistical significance measure is correct. Consequently, important motifs may be discarded because of the particular statistical significance measure being used. Thus, there is a trend in different fields towards motif discovery that does not use models.
There has been empirical evidence showing that the run-time for “model-less” motif discovery is linear in the output size for rigid motifs. However, none of the currently known algorithms has a proven output-sensitive complexity bound, and the only known complexity bounds are all exponential in the input size n. In other words, current pattern discovery algorithms depend on the size of the input, regardless of the size of the output of discovered patterns. This is important because one input may be the same size as another, yet produce a much smaller output of discovered patterns. With current pattern discovery, pattern discovery for both of these inputs will take approximately the same amount of time.
In order to apply motif discovery techniques to real life situations, one has to deal with the fact that, in many applications, the input is known with a margin of error. Many amino acids in protein sequences, for instance, are easily interchanged by evolution without loss of function. Also, the use of distance matrices in the context of DNA sequences is common. For example, a character a can be viewed as a or b for pattern detection purposes, but ab cannot be viewed as an a. In all these situations, it is possible to view the input as a string of sets of characters instead of just characters. For instance, a sequence of the form baccta can be viewed as b{a,b}cct{a,b}. In some other applications, the input is an array of real numbers, and two distinct real numbers are deemed identical for pattern detection purposes if they are within some given δ>0 of each other. Conventional motif discovery algorithms deal with these situations in an ad hoc manner, with no uniform framework, such that the same algorithm cannot tackle all the scenarios described above.
Thus, what is needed are techniques that overcome the following problems: (1) the problem of flexible and rigid pattern discovery within reasonable complexity and time; (2) the problem of solely input-sensitive complexity; and (3) the problem of the non-uniform framework for real numbers and sets of strings of characters during pattern discovery.