1. Field of the Invention
The present invention generally relates to a method and system for extraction of extensible motifs. More particularly, the present invention relates to a method and system for extraction of extensible motifs using combinatorial and statistical pruning.
2. Description of the Related Art
The discovery of a motif in a biosequence is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, the variety of motifs described by strings that include “don't care” patterns escalates exponentially with the length of the motif, and this only gets worse if a “don't care” is allowed to stretch up to some prescribed maximum length. This circumstance tends to generate daunting computational burdens, and often gives rise to tables that are impossible to visualize and digest. This is unfortunate, as it seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of massive amount of genomic and protein data. While part of the problem is endemic, another part of it seems rooted in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone.
The discovery of motifs in bio-sequences is attracting increasing interest due to the perceived multiple implication of motifs in biological structure and function. The approaches to motif discovery may be partitioned into two main classes. In the first class, the sample string is tested for occurrences of motifs in a family of a priori defined, abstract models or templates. The second class of approaches assumes that the search may be limited to substrings in the sample or to some more or less controlled neighborhood of those substrings. The approaches in the first class are more rigorously justifiable, but often pose daunting computational burdens. Those in the second class tend to be computationally viable but rest on more shaky methodological grounds.
The characterizations offered for the notion of a motif could be partitioned roughly into statistical and syntactic. In a typical statistical characterization, a motif is a sequence of m positions such that at each position each character from (some subset of) the alphabet may occur with a given probability or weight. This is often described by a suitable matrix or profile, where columns correspond to positions and rows to alphabet characters. The lineage of syntactic characterizations could be ascribed to the theory of error correcting codes: a motif is a pattern w of length m and an occurrence of it is any string at a distance of d, the distance being measured in terms of errors of a certain type. For example, we can have only substitutions in a Hamming variant, substitutions and indels in a Levensthein variant, and so on. Syntactic characterizations enable us to describe the model of a motif, or a realization of it, or both, as a string or simple regular expression over an extension of the input alphabet Σ, e.g., over Σ∪{.}, where “.” denotes the “don't care” character.
Irrespective of the particular model or representation chosen, the tenet of motif discovery equates over-representation of a motif with surprise and hence with interest. Thus, any motif discovery algorithm must ultimately weigh motifs against some threshold, based on a score that compares empirical and expected frequency, perhaps with some normalization. The departure of a pattern w from expectation is commonly measured by so-called z-scores, which have the form:
                              z          ⁡                      (            w            )                          =                                            f              ⁡                              (                w                )                                      -                          E              ⁡                              (                w                )                                                          N            ⁡                          (              w              )                                                          (        1        )            
where:
f(w)>0 represents a frequency;
E(w)>0 represents an expectation; and
N(w)>0 is the expected value of some function of w.
For given z-score function, set of patterns W, and real positive threshold T, patterns such that z(w)>T or z(w)<−T are respectively dubbed over- or under-represented, or simply surprising. The problem is that the number of patterns extracted in this way may escalate quite rapidly, a circumstance that seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of whole genomes.
Large-scale statistical tables may not only impose an unbearable computational burden. They are also impractical to visualize and use, a circumstance that may defy the purpose of building them in the first place.
A little reflection establishes how an exponential build-up may take place. Assume that on the binary alphabet both aabaab and abbabb are asserted as reflections of candidate interesting motifs. A concise description of this motif is a.ba.b, with “.” denoting the don't care, and then look for further occurrences of this motif. By this, however, the spurious patterns aababb and abbaab are also annexed.
A similar problem presents itself in the approaches that resort to the profiles or the weighted matrices previously mentioned. Even setting aside computational aspects, tables that are too large at the outset run the risk of saturating the visual bandwidth of a user. In this spirit, approaches that limit the number of patterns to be considered from the start may provide a more significant throughput, even in comparison with exhaustive methods.