The present invention generally relates to data sequence processing methodologies and, more particularly, to methods and apparatus for discovering and generating motifs with respect to sequences of data such as, for example, sequences of characters, character sets and real numbers.
Given an input sequence of data, a xe2x80x9cmotifxe2x80x9d is a repeating pattern, possibly interspersed with don""t-care characters, that occurs in the sequence. The data could be characters or sets of characters or real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be uncountably infinite number of motifs. Typically, the higher the self similarity in the sequence, the greater is the number of motifs in the data.
Motif discovery on such data, such as repeating DNA or protein sequences, is a source of concern since such data exhibits a very high degree of self-similarity (repeating patterns). Usually, this problem of an exploding number of motifs is tackled by pre-processing the input, using heuristics, to remove the repeating or self-similar portions of the input. Another way of trimming down the number of motifs is to use a xe2x80x9cstatistical significancexe2x80x9d measure. However, due to the absence of a good understanding of the domain, there is no consensus over the right model to use.
Thus, there is a trend towards model-less motif discovery in different fields. To keep the problem manageable, it is useful to identify a small number of motifs that capture important information about the family of motifs. However, no conventional method exists which is able to satisfactorily identify a small enough number of such motifs to provide applications interested in utilizing the motifs with manageable and practical results.
The present invention provides methods and apparatus for detecting motifs in sequences of data, such as, for example, sequences of characters, character sets and real numbers, such as to provide applications interested in utilizing the motifs with manageable and practical results. Particularly, the invention provides that, for any sequence, there exists only a linear number of special motifs and every other motif can be simply generated from them. We name these special motifs xe2x80x9cirredundant motifs.xe2x80x9d The result is meaningful also from an algorithmic viewpoint, since the ideas from the proof can be used to design a polynomial time algorithm to detect these irredundant motifs. This bound on the number of useful motifs gives validation to motif-based approaches, since the total number of irredundant motifs does not explode. This result is of significance to most applications that use pattern discovery as the basic engine such as data mining, clustering and matching. This family of irredundant motifs is also very characteristic of the family of all the motifs. For example, in applications such as multiple sequence alignment, we have shown that the irredundant motifs suffice to obtain the alignment. However, in applications that use the motifs to extract signature motifs of sets of sequences, all the motifs, including the redundant ones, may be of relevance.
In one illustrative aspect of the present invention, a method of detecting repeating patterns (i.e., motifs) in an input data sequence, wherein the data sequence includes elements from an element alphabet, comprises the steps of: (i) obtaining the input data sequence; (ii) constructing a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence; (iii) removing a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set; (iv) for each pair of compatible patterns in the set, constructing a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and (v) storing the patterns, and associated location lists, remaining after the removing step and the new pattern constructing step as the detected repeating patterns. The one or more patterns may further include one or more don""t care positions, i.e., don""t care characters. As mentioned, the elements of the input data sequence may be in various forms, e.g., characters, character sets or real numbers.
In accordance with such a methodology, as will be explained below, no more than 3n repeating patterns are detected given an input data sequence which includes n elements. The no more than 3n repeating patterns are maximal and non-redundant, i.e., they are irredundant patterns or motifs. From these irredundant patterns, patterns that are non-maximal and/or redundant may be generated. In one embodiment, the non-maximal and/or redundant patterns are generated in accordance with one or more annotated tries.
In one exemplary application, as will be explained below, the input data sequence is a protein sequence and the inventive methodologies are used in accordance with protein sequence homology detection.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.