1. Technical Field
The invention relates to the field of pattern discovery, and, more specifically, to pattern discovery in 1-dimensional event streams.
2. Description of the Related Art
An event stream is a sequence of events which are taken from a finite set of possible events. This latter set can be thought of as an alphabet in which case an event stream is a string over that alphabet. The term sequence used below may refer to an event stream or a sequence of characters belonging to an alphabet. A pattern is a specific set of letters with a given spatial arrangement, typically described as a regular expression.
An example of such a pattern is xe2x80x9cAF..H..RRxe2x80x9d where the dots are used to indicate that the respective positions could be occupied by xe2x80x9canyxe2x80x9d letter (xe2x80x9cdon""t carexe2x80x9d character). An event string is said to match the pattern at a given position i, if and only if the letters of the pattern all match the corresponding letters of the event string, when placed at offset i; a don""t care character is assumed to match any letter of the alphabet. For example, xe2x80x9cAF..H..RRxe2x80x9d matches xe2x80x9cHWIRTAFLKHAARRIKWLxe2x80x9d at position 6.
The problem of pattern discovery is computationally a very demanding one. Indeed, it can be proven to be NP-hard (unless the type of patterns sought is extremely simple). The problem can be stated as follows:
xe2x80x9cGiven a set S={s1, s2, . . . , sm} of one ore more sequences si (i.e. strings) over an alphabet xcexa3 of letters and positive integer K, find all the patterns which match K or more of the input sequences in S.xe2x80x9d
In this first formulation, what is sought is those patterns that appear in at least K of the sequences of the input set. However, it may happen that a pattern appears in fewer than K of the sequences, but more than once in some of those sequences. In other words, one or more sequences may contain multiple occurrences of a given pattern. Consequently, such a pattern may appear in fewer than K sequences but more than K times when all sequences of S are considered. The definition of the problem can then be modified to capture this case as well:
xe2x80x9cGiven a set S={s1, s2, . . . , sm} of one or more sequences si over an alphabet xcexa3 of letters and positive integer K, find all the patterns which appear K or more times in the sequences in S.xe2x80x9d
The invention that is described below can handle both of these versions of the pattern discovery problem.
The problem of discovering patterns in event streams appears very often and in many different application areas (biology, data mining in databases, computer security etc.). Depending on the particular domain at hand there are many different definitions of the notion of the pattern as well as of what it means for an input to match a particular pattern.
Almost invariably the input is a set S composed of arbitrary strings over a given alphabet xcexa3 and a pattern is any member of a well defined subset C of all the possible regular expression over xcexa3 (e.g. C can be the set of all regular expression with some predefined maximum length). The reasons why the search for patterns is restricted to some subset of all possible regular expressions are both domain-specific and also of a practical nature since (as is explained later) the problem is computationally extremely demanding.
Being a regular expression, every pattern P defines a language L(P) in the natural way: a string belongs to L(P) if it is recognized by the automaton of P. An input string w is said to match a given pattern P if w contains some substring that belongs to L(P). The problem then is to a discover all patterns in C matched by at least a given (user-defined) minimum number of strings in the input set S.
For an illustration, consider the following set of strings over the English alphabet:
S={xe2x80x9cLARGExe2x80x9d, xe2x80x9cLINGERxe2x80x9d, xe2x80x9cAGExe2x80x9d}
In this case the pattern xe2x80x9cL..GExe2x80x9d has support 2 since it is matched by the first two strings of S (the special character xe2x80x98.xe2x80x99 in the pattern indicates a position that can be occupied by any character). The term support is used here to denote the number of input strings matching a given pattern. As another example, the pattern xe2x80x9cA*GExe2x80x9d has also support 2 (it is matched by the first and the last strings). Here, the character xe2x80x98*xe2x80x99 is used to match substrings of arbitrary length, which can include strings of zero length.
The computational complexity of the pattern discovery problem directly depends on the class C of regular expressions that the patterns belong to. In the most straightforward case a pattern is simply a string over the alphabet xcexa3 (i.e. just a sequence of alphabet symbolsxe2x80x94no don""t care characters) and the problem of finding all such patterns with a given minimum support can be solved in polynomial time using generalized suffix trees. An example of such a pattern discovery problem is illustrated in Hui, xe2x80x9cColor Set Size Problem with Applications to String Matchingxe2x80x9d, Proceedings of the 2nd Sumposium on Combinatorial Pattern Matching, 1992, pp. 230-243.
In almost every other case though, the class C is expressive enough to render the problem NP-hard. The hardness result can be usually shown by a reduction from the longest common subsequence problem, discussed in Garey and Johnson, xe2x80x9cComputers and Intractability: a Guide to the Theory of NP-Completenessxe2x80x9d, 1979, and Maier xe2x80x9cThe Complexity of Some Problems on Subsequences and Supersequencesxe2x80x9d, Journal of the ACM, 1978, pp. 322-336. What this means in practice is that there can be no algorithm guaranteed to produce, for every possible input, all the required patterns in time polynomial to the input length.
One way to bypass the hardness of the problem is to design approximation algorithms, i.e. algorithms that are guaranteed to work xe2x80x9cfastxe2x80x9d for every input but do not necessarily find all the existing patterns for every given input. Another approach is to further restrict the patterns that the algorithm can discover so that they can be found efficiently. This is usually left to the discretion of the user by providing him/her with the ability to appropriately set a number of parameters that decide the structure of the patterns to be looked for. A typical example is to allow the user to define the maximum length that a pattern can have. Providing the user with that kind of control over the search space is not unreasonable since an expert user can apply domain knowledge to help a program avoid looking for patterns that are meaningless or impossible for the domain at hand. In fact, this expert knowledge is usually an integral part (in the form of various heuristics) of most of the pattern discovery algorithms that exist in the literature (the disadvantage of this approach is that most of these algorithms are usually inapplicable outside the particular domain which they were designed for). Finally, there are the algorithms that just accept the hardness of the problem and proceed head-on to find all possible patterns. Such algorithms are bound to be inefficient (space and/or time-wise) on some xe2x80x9cbadxe2x80x9d inputs but, depending on the domain at hand, their overall performance (amortized over all xe2x80x9creasonablexe2x80x9d inputs) might be quite satisfactory. In such cases, it becomes very important to introduce a number of heuristics that will speed up the algorithm. The method presented here belongs to this category.
The standard way to assess the xe2x80x9cqualityxe2x80x9d of an algorithm (at least in the context of computer science) is by its time/space complexity. For the pattern discovery problem, though, such a characterization is rather useless. The reason is the NP-hardness of the problem: any worst-case analysis is doomed to give bounds which are super-polynomial (unless the complexity classes P, NP are equal, something extremely unlike). There are, however, other features that are of interest when evaluating a pattern discovery algorithm. Some of these features are:
The subclass C of regular expressions containing the patterns under consideration. In general, it is desirable to have as expressive a class C as possible. The price for the increased expressiveness is usually paid in terms of time/space efficiency.
The ability of the algorithm to generate all qualified patterns. As mentioned earlier, some approximation algorithms can achieve increased performance by sacrificing the completeness of the reported results. Depending on the domain of the application and the quality of the patterns discovered, this might or might not be an acceptable tradeoff.
The maximality of the discovered patterns. Consider for example the instance of the input set S given at the beginning of this section. In this case xe2x80x9cL...Exe2x80x9d is a perfectly legal pattern. It is not however maximal, in the sense that the pattern xe2x80x9cL..GExe2x80x9d, while more specific, still has the same support as xe2x80x9cL...Exe2x80x9d. Reporting patterns which are not maximal not only unnecessarily clutters the output (making it difficult to separate the patterns which are really important) but can also severely affect the performance of the algorithm.
It is then extremely important for a pattern discovery algorithm to be able to detect and discard non-maximal patterns as early as possible.
The pattern discovery algorithms can, in general, be categorized in either of the following two classes: string alignment algorithms and pattern enumeration algorithms. Below we present a short survey of both categories. A more detailed description of the two classes of pattern discovery algorithms can be found in Brazma et al., xe2x80x9cApproaches to the Automatic Discovery of Patterns in Biosequencesxe2x80x9d, Technical Report, Department of Informatics, University of Bergen, 1995. The list of algorithms discussed is certainly not exhaustive but it highlights the main trends in the respective classes.
Algorithms in the string alignment class use multiple alignment of the input strings as the basic tool for discovering the patterns in question. Given a set S={s1, . . . , sn} of strings over an alphabet xcexa3 and a number of edit operations (e.g. mutation of a character into another character, deletion of a character, insertion of a character etc.), a multiple alignment of the strings in S is defined as:
a string wxcex5xcexa3*, called a consensus sequence
for each 1xe2x89xa6ixe2x89xa6n, a sequence of edit operations for transforming si into w.
As soon as a multiple string alignment is acquired, the task of locating the patterns reduces to searching the consensus sequence for substrings with high enough support.
The problem of course is not that simple: the consensus sequence must be appropriately chosen so as to reveal the patterns shared by the input strings. This entails the assignment of costs to the various edit operations and the selection of an optimal consensus sequence that minimizes the cost of transforming the input strings into that particular sequence. A complete description of the problem is outside the scope of this document. What is important to note, though, is the fact that the problem of finding such an optimal sequence is NP-hard. As a result, most algorithms in this class resort to heuristics that produce sub-optimal alignments, thus trading-off enhanced execution speed with results that are generally not complete. There are also other problems related to multiple string alignment (e.g. the problem of domain swapping) which further complicate the generation of a complete list of patterns for a given input set. In general, using multiple string alignment for the discovery of patterns can be effective only when the aligned strings share global similarities as discussed in Smith et al., xe2x80x9cFinding Sequence Motifs in Groups of Functionally Related Proteinsxe2x80x9d, Proceedings of the National Academy of Sciences, pp. 826-830, 1990; Hirosawa et. al. xe2x80x9cComprehensive Study On Iterative Algorithms Of Multiple Sequence Alignmentxe2x80x9d, CABIOS, 1995; and Suyama et al., xe2x80x9cSearching for Common Sequence Patterns Among Distantly Related Proteinsxe2x80x9d, Protein Engineering, pp. 1075-1080, 1995.
Almost all the algorithms in this class have been developed for finding patterns shared by a number of (allegedly related) biological sequences. The reason is that the edit operations of mutation, insertion and deletion are the mechanisms used by evolution to differentiate among species. This makes the utilization of multiple string alignment a natural tool for attacking the pattern discovery problem in the context of Biology.
Most of the string alignment algorithms use pairwise alignments and then apply a number of heuristics in order to approximate an optimal multiple alignment of the input strings. In Martinez, M., xe2x80x9cA Flexible Multiple Sequence Alignment Programxe2x80x9d, Nucleic Acids Research, 1988, pp. 1683-1691, all possible pairs of input strings are aligned and scored. Then an ordering of the input strings is generated, based on the alignment scores (the intention is for strings that are similar to be placed close together in that ordering). The final multiple alignment is built in a piece-wise manner, by traversing the ordered list of the input strings: each time a new string is added the old alignment is modified (by adding insertions where appropriate) so that the character to character matches of the original, pairwise alignments are preserved.
A slightly different approach is pursued in Smith and Smith, xe2x80x9cAutomatic Generation of Primary Sequence Patterns from Sets of Related Protein Sequencesxe2x80x9d, Nucleic Acids Research, 1990, pp. 118-122. Again the starting point is generating and scoring all possible pairwise alignments. Scoring is performed based on a partition of the amino acid alphabet into amino acid class covering (AACC) groups, based on the physicochemical properties of the amino acids. Using the scores obtained at the first step a binary tree, called dendrogram, is built having the input sequences as its leaves; the intention is to cluster together those of the input sequences which are similar. Then the internal nodes of the dendrogram are traversed bottom-up and each node is assigned a label. This label is a pattern, obtained by aligning the two children of the node (a child is either an original sequencexe2x80x94if it is a leafxe2x80x94or a pattern obtained through a previous alignment);in the course of the alignment procedure, aligned characters that differ are represented by the smallest AACC group that contains both of them. At the end, each internal node has been labelled by a pattern which is contained in all the input sequences that are leaves of the subtree rooted at that internal node.
Another algorithm (Emotif ) employing multiple string alignment as the main tool for the discovery of patterns is presented in Neville-Manning et al., xe2x80x9cEnumerating and Ranking Discrete Motifsxe2x80x9d, Intelligent Systems for Molecular Biology, 1997, which extends the work of Wu and Brutlag in xe2x80x9cIdentification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniquesxe2x80x9d, Proceedings of the 3rd International Conference on Intelligent Systems for Molecular Biology, 1995, pp. 402-410. Here, along with the set S of the input strings the user also provides a collection R⊂2xcexa3 of subsets of the alphabet xcexa3. A pattern can have at a given position any element Excex5R and a character c of an input string matches that pattern position if cxcex5E. The set R provides a generalization of the AACC groups used in Smith and Smith, xe2x80x9cAutomatic Generation of Primary Sequence Patterns from Sets of Related Protein Sequencesxe2x80x9d, Nucleic Acids Research, 1990, pp. 118-122. Emotif starts by generating a multiple alignment of the input strings. The alignment is used to guide the generation of the patterns in a recursive way: at each point a subset Sxe2x80x2 of the original set S is being considered (originally Sxe2x80x2=S) and a particular column in the alignment of the strings in Sxe2x80x2 (originally the first column). Also, there is a pattern P currently under expansion (originally P is the empty string). The pattern P is expanded to Pxe2x80x2=PE where Excex5R contains at least one of the characters found in the strings of Sxe2x80x2 at the alignment column under consideration. The expansion proceeds as long as the new pattern has sufficiently large support and is not redundant, i.e. does not appear at the same sequences (and positions within the sequences) where a previously generated pattern has appeared. At the next expansion step the set Sxe2x80x2 will be set to those strings that match the new pattern Pxe2x80x2 and the next column of the alignment will be considered.
A different heuristic is proposed by Roytberg in xe2x80x9cA Search for Common Patterns in Many Sequencesxe2x80x9d, CABIOS, pp. 57-64, 1992. Although his method does not directly use alignment, it works in a way reminiscent of the other algorithms in this class (in that it gets information about potential patterns by pairwise comparisons of the input strings). Here, one of the input sequences is selected as the basic sequence and is compared with all other sequences for similar segments (the notion of similarity is a parameter of the algorithm). A segment of the basic sequence gives rise to pattern if at least k sequences (the minimum required support) have segments similar to it. The major drawback of this method is that it is crucially dependent on the selection of the basic sequence. This drawback can be partially offset by employing repeated runs of the algorithm, each one with a different basic sequence.
A comprehensive comparative study of a number of multiple string alignment algorithms can be found in Hirosawa et al., xe2x80x9cComprehensive Study on Iterative Algorithms of Multiple Sequence Alignementxe2x80x9d, CABIOS, 1995.
Algorithms in the pattern enumeration class enumerate all (or some) possible patterns and then verify which of these patterns have the required support. Since such algorithms explore the space of patterns they tend to be exponential on the maximal pattern size. In order to be efficient, they usually impose some kind of restriction to the structure of the patterns to be discovered.
Very roughly, the underlying idea used by all these algorithms is the following: start with the empty pattern and proceed recursively to generate longer and longer patterns. At each step enumerate all (or some) allowable patterns (i.e. those patterns that belong to the subclass C of regular expressions treated by the algorithm) that have the current pattern as prefix. For every new pattern check its support. If it is big enough continue the expansion. If not just report the current pattern and backtrack to the previous step of the expansion procedure.
The various algorithms differ in the class C of patterns that they recognize, the efficiency which they implement pattern expansion, and their ability to detect and discard redudant patterns.
One of the first algorithms in this class, which is presented in Sobel and Martinez, xe2x80x9cA Multiple Sequence Alignment Programxe2x80x9d, Nucleic Acids Research, 1986, pp. 363-374, was actually a part of a method used for the multiple string alignment problem (some alignment programs approximate optimal alignments by first locating patterns common to-all strings and then using this patterns as the anchor points for the alignment). The algorithm works by first locating substrings X common to all the input sequences, each substring having length at least L. For every such X the set of all possible regions is defined, each region being an n-tuple (p1, . . . , pn) (n is the number of input sequences) where pi is an offset within the i-th sequence where X appears. These regions become vertices in a directed graph. If Rx=(x1, . . . , xn) Ry=(y1 . . . , yn) are regions corresponding to the substrings X and Y, then a line from Rx to Ry is added in the graph if the two regions are non-overlapping and Rx comes before Ry. In other words, if for every i
xi+Xxe2x88x921 less than yi.
Finally the graph is traversed for a longest possible path (they give a particular function for assigning costs to the edges but any meaningful cost function could be used in its place).
There is a number of weaknesses with this method. First, it can only find patterns supported by all the input sequences. Furthermore, in order for it to be efficient the parameter L for the minimum length of substrings to look for must be quite large. Otherwise a huge number of regions will be generated. As a result, patterns containing shorter substrings will go unnoticed.
A straightforward example of pattern enumeration appears in the work Smith et. al., xe2x80x9cFinding Sequence Motifs in Groups of Functionally Related Proteinsxe2x80x9d, Proceedings of the National Academy of Sciences, 1990, pp. 826-830. Again, the domain is that of Biology. They proceed by enumerating all patterns containing 3 characters and having the form
c1x(0, d1)c2x(0, d2)c3 
where ci are alphabet characters and x(i, j) indicates a flexible wildcard gap matched by any string with length between i and j (as a matter of terminology, we also speak of rigid gaps which have the form x(i) and are matched by any string of lenght exactly i). Each pattern thus generated is searched over all the input strings. If it has sufficient support all sequences where it appears are aligned along this pattern and the pattern is further expanded according to the alignment. The main disadvantage of this method is that it actually enumerates all possible patterns, making it hard to handle patterns with more than 3 characters and with large values for the parameters d1 and d2.
Suyama et. al. in xe2x80x9cSearching for Common Sequence Patterns Among Distantly Related Proteinsxe2x80x9d, Protein Engineering, 1995, pp. 1075-1080, describe an algorithm very similar to the above. They start by enumerating all triplets c1c2c3 and creating a cluster for every such triplet. Each cluster contains all the input sequences that contain the characters c1, c2, c3 (in that order) within a window of length W(a user defined parameter). Subsequently the clusters are further refined by adding a fourth character at the end of every triplet and breaking them up so that only sequences that contain all four characters (always in order) are in each cluster. Finally, a last refinement is performed with the addition of a fifth character. At that point each cluster undergoes a final break down by now requiring that the sequences not only contain the same five characters but that the number of positions between successive characters are also the same. One of the defects of this algorithm is that it handles only patterns with up to five characters. Of course, the same procedure can be extended to handle more than five characters but the efficiency drops very quickly as the number of characters and the size of the input set increases.
Neuwald and Green, xe2x80x9cDetecting Patterns in Protein Sequencesxe2x80x9d, Journal of Molecular Biology, 1994, pp. 698712, also use as a starting point the algorithm of Smith et. al., xe2x80x9cFinding Sequence Motifs in Groups of Functionally Related Proteinsxe2x80x9d, Proceedings of the National Academy of Sciences, 1990, pp. 826-830. Their method allows the discovery of rigid patterns of arbitrary length (obeying some structural restrictions set up by user defined parameters). Furthermore, they allow double character pattern positions of the form [c1c2] which can be matched by either of the characters c1, c2 (the number of these positions, though, must be kept to a minimum or the performance of the algorithm suffers). Their algorithm starts by enumerating all possible patterns with a given maximum length, given number of non-don""t care positions and a maximum allowed number of double character positions. The enumeration is carried out efficiently in a depth-first manner using blocks, a special data structure recording, for every character, all offsets of the input strings that the character has a fixed distance from. The enumeration is further sped up by pruning the patterns that do not achieve sufficient support. Using statistical analysis, they keep just a portion of the patterns thus generated (those that are deemed xe2x80x9cimportantxe2x80x9d). At the final step, these patterns are combined into longer patterns by pairwise comparison: two patterns are expanded into a new one if they have adequate xe2x80x9ccompatiblexe2x80x9d appearances, i.e. if there are enough sequences where the two patterns appear separated by a fixed displacement. This expansion operation is made possible because of the block structure representation of the patterns which contains the list of offsets the patterns appear in.
Based on the work of Neuwald and Green above, Collins et. al., xe2x80x9cFinding Flexible Patterns in Unaligned Protein Sequencesxe2x80x9d, Protein Science, 1995, pp. 1587-1595, gives an even more powerful pattern discovery algorithm. Theirs allows flexible patterns of the following form:
P=A1x(i1, j1)A2x(i2, j2) . . . A(pxe2x88x921)x(i(pxe2x88x921), j(pxe2x88x921))Ap 
where ikxe2x89xa6jk.
The Ai""s are character sets (consisting of one or more characters) and are called components. A component can be either of the identity or ambiguous type, depending on whether it contains one or more than one character. The wild-card regions x(i, j) are either rigid (if i=j; such a region is written in the simpler form x(i)) or flexible if (j greater than i). The input to the algorithm (except from the actual sequences) contains a host of other parameters that restrict the kind of patterns to look for (e.g. maximum pattern length, maximum number of components, maximum number of ambiguous components, the kind of ambiguous components allowed etc.) as well as the minimum support k for a pattern. The algorithm proceeds in two phases. In the first one, the block structure described in Neuwald and Green, xe2x80x9cDetecting Patterns in Protein Sequencesxe2x80x9d, Journal of Molecular Biology, 1994, pp. 698-712, is used to enumerate all patterns with length up to the maximum possible. As a preprocessing step, blocks bi, R are created for every allowable component R (be it either a single character or a set of characters). Every such block contains offsets in the input sequences. A particular offset is in bi, R if some character in R is at distance i from that offset. If only rigid patterns have been requested and P is the current rigid pattern at any given time, then they check all possible patterns Pxe2x80x2 of the form Pxe2x80x2=Px(j) R where x(j) is a rigid wild-card region of length j and R is either a single character or an allowable ambiguous component. If BP is the block structure for P (every structure carries along with it the list of offsets where it appears), then
BPxe2x80x2=BP∩bL(P)+j+1, R
where L(P) is the length of the pattern P (i.e. the size of every string matching P).
If, on the other hand, flexible patterns have also been allowed then a flexible pattern P is represented as the set F(P) of all the rigid patterns that make it up. For example, if
P=Dx(1, 2)Ex(2, 3)F,
then the set F(P) is comprised by the fixed patterns
Dx(1)Ex(2)F,
Dx(1)Ex(3)F,
Dx(2)Ex(2)F,
Dx(2)Ex(3)F
and every pattern Qxcex5F(P) carries its own block structure.
In this case the current pattern P is extended into Pxe2x80x2=Px(i, j)R using all possible values of ixe2x89xa6j. The block structure for Pxe2x80x2 is actually constructed by extending every pattern Qxcex5F(P) into a pattern Qk, (ixe2x89xa6kxe2x89xa6j) using the relation
BQk=BQ∩bL(Q)+k+1, R
and then the set F(Pxe2x80x2) becomes       F    ⁡          (              P        xe2x80x2            )        =            ⋃                        Q          ∈                      F            ⁡                          (              P              )                                                i          ≤          k          ≤          j                      ⁢          Q      k      
In both cases (flexible regions allowed or not) further expansion of Pxe2x80x2 is pruned if the block size of Pxe2x80x2 is less than the minimum support k (if Pxe2x80x2 is a flexible pattern then its block size is just the sum of the block sizes of all fixed patterns in F(Pxe2x80x2)).
In the second phase, the patterns discovered are further processed in a number of ways. Possibilities include replacing some don""t care characters by an extended alphabet of ambiguous components, extending a pattern, etc.
A similar algorithm is described by Sagot and Viari in xe2x80x9cA Double Combinatorial Approach to Discovering Patterns in Biological Sequencesxe2x80x9d, Proceedings of the 7th Symposium on Combinatorial Pattern Matching, 1996, pp. 186-208. Their algorithm also allows ambiguous components but it only treats rigid gaps. Again, the user must define (among other parameters like the maximum pattern length) which ambiguous components to use as well as, for every ambiguous component, the maximum number of times it is allowed to appear in a given pattern. They also proceed by recursively enumerating all allowable patterns, in a way very similar to that described in Jonassen et al., D. G., xe2x80x9cFinding Flexible Patterns in Unaligned Protein Sequencesxe2x80x9d, Protein Science, 1995, pp. 1587-1595. The entire process is made faster by the introduction of two heuristics.
First, an effort is made to avoid creating redundant patterns. Let P be the pattern currently expanded. If S, Sxe2x80x2 are both allowable ambiguous components with S⊂Sxe2x80x2 and PS, PSxe2x80x2 are both possible extensions of P and the offset lists of PS and PSxe2x80x2 are the same then only the pattern PS is maintainedxe2x80x94the expansion is pruned at PSxe2x80x2. This heuristic does not detect all non-maximal patterns: because of the way their algorithm builds the patterns, some redundancies will go unnoticed.
Second, a clever trick is used in order to reduce the input size. Instead of considering the ambiguous components specified by the user, they originally replace all of them by the don""t care character. This greatly simplifies the pattern space to search and makes the entire discovery process much faster. After this step only the part of the input that matches the (reduced) patterns found is maintained and it is on this subset of the input that the algorithm is rerun, now using the full pattern specification given by the user.
A different approach in enumerating and verifying potential patterns is proposed by Wang et. al. in xe2x80x9cDiscovering Active Motifs in Sets of related Protein Sequences and Using them for Classificationxe2x80x9d, Nucleic Acids Research, 1994, pp. 2769-2775. The pattern language they treat is composed by m-component patterns of the form
P=X1*X2* . . . Xm 
where the number m is a user specified parameter.
The components Xi are alphabet strings and the xe2x80x98*xe2x80x99 stands for flexible gaps of arbitrary length. The length P of the pattern is defined as the sum, over all i of the lengths Xi.
The algorithm works in two phases. First, the components Xi of a potential pattern are computed. This is done by building a generalized suffix tree (GST) for the suffixes of all input sequences. Each leaf corresponds to a suffix and is labeled with the sequence containing this suffix (notice that more than one leafs can be attached to a given suffix since the suffix might appear in many sequences). Every internal node u contains the number count(u) of distinct sequences labelling the leafs of the subtree rooted at that node. So, if locus(u) is the string labelling the path from the root of the GST to an internal node u, then count(u) is the number of distinct sequences containing the substring locus(u). The first phase ends by creating the set Q containing all strings locus(u) such that count(u) xe2x89xa7k, where k is the minimum support.
The second phase verifies which m-component patterns (among those that can be built from the elements of Q) have the minimum support. This is a computationally demanding process since every potential pattern has to be compared against all the input strings. To speed things up, the space of all potential m-component patterns is pruned using the following statistical heuristic: first a (small) random subset Sxe2x80x2 of the original set S of input sequences is chosen. For every potential pattern P its support in Sxe2x80x2 is computed by matching it to the strings in Sxe2x80x2. Then, using sampling theory, the probability that the actual support of P in S is k or more is computed, given the support of P in Sxe2x80x2. If this probability is very small, then the pattern P is thrown away. So, at the end of the heuristic, patterns which are unlike to have the required support in S have been discarded. The remaining patterns are then verified by computing their support over the original input set S. Those that appear in at least k of the input sequences are reported in the final results.
The use of the above mentioned statistical heuristic makes it possible to treat sets S with many sequences (although it entails the possibility of not detecting some important patterns that are not xe2x80x9cluckyxe2x80x9d enough to pass the statistical test). Furthermore, the approach is really effective only if the minimum support k is comparable to the number of sequences in S. Another drawback of the algorithm is that its performance deteriorates very fast as the parameter m, defining the structure of the patterns searched for, increases. Finally, there is no provision for the detection of redundant patterns.
Another example of a pattern discovery algorithm is given by Agrawal and Srikant in xe2x80x9cMining Sequential Patternsxe2x80x9d, International Conference on Data Engineering, 1995. The domain here is that of data mining. The formulation of the problem in this setting is slightly different. More specifically, the input consists of strings defined over (2xcexa3xe2x88x92), where xcexa3 is the underlying alphabet. In other words, a string is a sequence of subsets of xcexa3 rather than a sequence of elements of xcexa3. Given now two such strings A=(a1 a2 . . . an) and B=(b1b2 . . . bm), mxe2x89xa7n, we say that A is contained in B if there are indices 1xe2x89xa6i1 less than i2 less than  . . .  less than inxe2x89xa6m such that
a1⊂bi1xcex9a2⊂bi2 . . . xcex9an⊂bin Given a set S of strings and a query string q, then q has support at least k in S if it is contained in at least k sequences of S. Also, q is called maximal if there is no other sequence qxe2x80x2 with support at least k such that q is contained in qxe2x80x2.
The problem solved by the pattern discovery algorithm of Agrawal et al. is to find, given a set S of strings and a number k, all maximal strings with support at least k in S. The algorithm proposed works in a number of phases. In the first phase all subsets of xcexa3 that have support at least k are found. Call this set E. Then, in order to simplify the subsequent operations, a distinct integer label is assigned to every element of E. In the second phase, called transformation, each s=(I1, . . . , Is) xcex5S is examined in turn and every Ij⊂xcexa3 is replaced by those elements of E (actually, their integer labels) which are subsets of Ij. The main work of the algorithm is done in the third phase, the sequencing phase.
In the sequencing phase, the strings that potentially have the minimal support are enumerated and verified. A number of different methods are proposed in order to carry out these tasks. The main idea in all of them is the following recursive procedure: let Li be the set of all strings with length i that have support at least k in S (initially L1=E, where E is the set computed in the first phase of the algorithm). The set Li is used in order to generate the set Ci+1 of candidates, containing the length (i+1) strings that may have minimum support. The intuition behind the generation of Ci+1 is that if w=w1 . . . wi+1 has support at least k, then the same must be true for every length i subsequence of w. The set Ci+1 is generated by joining the set Li with itself in the following way
For every (ordered) pair (x, y) of elements of Li where x=(xi . . . xi), y=(y1 . . . yi) which are such that xj=yj for all 1xe2x89xa6jxe2x89xa6(ixe2x88x921), generate the candidate string sxe2x80x2=(xi x2 . . . xixe2x88x921 xi yi) and add it into Ci+1.
Go through Ci+1 removing those sequence sxcex5Ci+1 which contain an i-subsequence not in Li. After Ci+1 has been generated, it is verified against the set S of the input strings: for every candidate string xxcex5Ci+1 the strings sxcex5S containing x are located. If their number is at least k then x is added in Lixe2x88x921 otherwise x is discarded. The whole process continues until some i is found for which Li+1 turns out to be empty.
The final phase of the algorithm makes sure that only maximal strings are reported. This is achieved by going through every xxcex5Li (in decreasing order of i), and deleting all yxcex5Lj (jxe2x89xa6i) such that y is contained in x.
Other algorithms for various versions of the pattern discovery problem have also been proposed. In Guan and Uberbacher, xe2x80x9cA Fast Look-Up Algorithm for Detecting Repetitive DNA Sequencesxe2x80x9d, Pacific Symposioum on Biocomputing, 1996, pp. 718-719, an algorithm for the identification of tandem repeats in DNA sequences is described. A tandem repeat is a restricted form of a pattern containing consecutive repetitions of the same substring. E.g. xe2x80x9cAATAATAATAATAATAATxe2x80x9d is a tandem repeat of the substring xe2x80x9cAATxe2x80x9d. The substring being repeated is called the seed of the tandem repeat. Guan and Uberbacher give a linear time algorithm for the identification of short tandem repeats (where the seed of the repeat is a few characters long) using a hashing scheme introduced in Califano, A. and Rigoutsos, I., xe2x80x9cFLASH: A Fast Look-Up Algorithm for String Homologyxe2x80x9d, CABIOS. Their method computes for every offset in the input at hand a number of hash values (using a few characters within a small window around the offset under consideration). The seeds of a tandem repeat are identified by locating offsets that have several such hash values in common. Benson and Waterman, xe2x80x9cA Method for Fast Database Search for all k-nucleotide Repeatsxe2x80x9d, Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, 1994, pp. 83-98, provides another approach for the same problem. They begin by identifying suspicious patterns (i.e. substrings that can, potentially, be seeds for a tandem repeat) and check the area around the suspicious pattern to see if that pattern appears (either unchanged or mutated) in several consecutive copies. Their method incorporates elements from alignment algorithms in checking for the copies of the suspicious pattern.
Each of the two classes of algorithms described above have their pros and cons. By allowing the operations of insertion and deletion, string alignment methods can locate flexible patterns, a category of increased expressiveness (and, thus, complexity). Also, by using fast approximations to optimal multiple string alignments, patterns can be quickly discovered. On the other hand, no matter how one chooses to assign cost for the allowable edit operations, there will always be inputs containing patterns that cannot be encapsulated by the optimal consensus sequence (and this remains true even if near-optimal consensus sequences are considered). This problem becomes more acute for inputs where the sequences do not have global similarities. As a result, string alignment methods can be a viable approach only if
the completeness of the reported resulted is not an absolute necessity,
the input sequences can be clustered into groups that have global similarities, so that the alignment can produce xe2x80x9crelevantxe2x80x9d consensus sequences.
The pattern enumeration algorithms, on the other hand, have the potential of generating all non-redundant patterns. The price to be paid for completeness, though, can be steep since a huge space of potential patterns has to be searched. This search problem is magnified many-fold if one is to allow flexible patterns and/or patterns that allow multiple character positions. Furthermore, making sure that only maximal patterns are reported is a task which is far from trivial (if efficiency is to be achieved).
The problems stated above and the related problems of the prior art are solved with the principles of the present invention, method and apparatus for pattern discovery in biological event streams. In a sampling phase, patterns are generated and stored in memory. These patterns may be generated using templates corresponding to a sequence of characters. Preferably, such templates are proper templates. Patterns are then generated corresponding to the templates and stored in memory. Alternatively, the patterns of the sampling phase can be generated by starting with a seed pattern and extending it by a single position at a time, checking at every step of the process if the new pattern has the required support and pruning the search if it does not. In a convolution phase, the patterns stored in memory are combined to identify a set of maximal patterns.
The pattern discovery method of the present invention is advantageous because all patterns with a minimum support are reported without enumerating the entire pattern space. This makes our approach efficient. Moreover, the patterns are preferably generated in order of maximality, i.e. the maximal patterns are generated first. As a result, a redundant pattern can be easily detected by comparing it to the patterns already generated. Thus, no costly post-processing or complicated bookkeeping is required.