A rigid motif is a repeating pattern in a string of data comprising a plurality of tokens such as alphabet characters, possibly interspersed with don't-care characters that has the same length in every occurrence in the input sequence. Pattern or motif discovery in data is widely used for understanding large volumes of data such as DNA or protein sequences.
Allowing the motifs to have a variable number of gaps (or don't-care characters), called patterns with spacers or extensible motifs, further increases the expressibility of the motifs. For example, given a string s=abcdaXcdabbcd, m=a.cd is a rigid pattern that occurs twice in the data at positions 1 and 5 in s. In the above example, the extensible motif, where the number of don't-care characters between a and c of the pattern is one or two, would occur three times at positions 1, 5 and 9. At position 9 the dot character represents two gaps instead of one.
The task of discovering patterns must be clearly distinguished from that of matching a given pattern in a string of characters or database. In the latter situation, we know what we are looking for, while in the former we do not know what is being sought. Typically, the higher the self similarity (i.e., repeating patterns) in the sequence, the greater is the number of patterns or motifs in the data. Motif discovery on data such as repeating DNA or protein sequences is indeed a source of concern because these exhibit a very high degree of self similarity. The number of rigid maximal motifs could potentially be exponential in the size of the input sequence.
The problem of a large number of motifs is usually tackled by pre-processing the input, using heuristics, to remove the repeating or self similar portions of the input or using a statistical significance measure. However, due to the absence of a good understanding of the domain, there is no consensus over the right model to use. Thus there is a trend towards model-less motif discovery in different fields. There has been empirical evidence showing that the run time is linear in the output size for the rigid motifs and experimental comparisons between available implementations.
Consider an example of a rigid pattern (albeit with don't-care characters) discovery tool that is often inadequate in detecting biologically significant motifs. Fibronectin is a plasma protein that binds cell surfaces and various compounds including collagen, fibrin, heparin, DNA, and actin. The major part of the sequence of fibronectin consists of the repetition of three types of domains, which are called type I, II, and III. The type II domain is approximately forty residues long, contains four conserved cysteines involved in disulfide bonds and is part of the collagen binding region of fibronectin. In fibronectin the type II domain is duplicated. Type II domains have also been found in various other proteins. The fibronectin type II domain pattern has the following form:
C...PF.[FYWI].......C-(8,10)WC....[DNSR] [FYW]-(3,5)[FYW].[FYWI]C
The extensible part of the pattern is shown as integer intervals. It is clear that a rigid pattern discovery tool will never capture this as a single domain. Therefore, there is need for a system and method of pattern discovery that overcomes the drawbacks in the prior art.