Prior art methods of discovering patterns of symbols in a family of symbol sequences are computationally intensive. The computational intensity is dependent upon the lengths of the sequences (i.e., number of symbols in each sequence) and the size of the alphabet (i.e., the number of distinct symbols found in each sequence). Running time (i.e., the number of computational steps required) for the prior art methods tends to increase in proportion to the product of the lengths of the sequences and decrease in proportion to the alphabet size.
Patterns that occur in (i.e., are common to) “q” number of sequences in a family of “k” sequences are said to have q “levels of support”. For example, patterns that are common to two sequences are said to have a level of support of two. Patterns that are common to a greater number of sequences in a family are said to have a greater level of support. Patterns with greater levels of support are usually more descriptive of so-called “features”, or properties, of the underlying system. In biology, for example, these features characterize chemical or physical properties of proteins or nucleic acids.
The method of published United States Patent Application 2003-0220771-A1, Vaidyanathan el al., assigned to the assignee of the present invention, discovers patterns in two or more sequences. The method of this application first discovers patterns of symbols in pairs of sequences, then finds patterns of symbols at increasingly higher levels of support based upon the patterns found in the pairs. The identity of the symbols in the patterns is retained throughout the practice of this method, and all calculations are done with the alphabet of those symbols. Retaining the symbol identity may detract from the efficiency of the method.
In view of the foregoing it is believed advantageous to be able to discover patterns common to two or more sequences in a family of sequences in a more computer-efficient manner.