Pattern Discovery is a nascent competency in the rapidly developing field of computational biology. As genomic data is collected at ever increasing rates it is essential that powerful tools be available to discover the key information embedded in sequence data. This information could be, for example, important sequence similarities across different genes, or structural relationships between similar proteins. Discovery of such information will facilitate biochemical discovery and will accelerate rapid development of new products engineered to have desired end use properties.
Computational biology is defined as “A field of biology concerned with the development of techniques for the collection and manipulation of biological data, and the use of such data to make biological discoveries or predictions. This field encompasses all computational methods and theories applicable to molecular biology and areas of computer-based techniques for solving biological problems including manipulation of models and datasets” (Online Medical Dictionary, 1998, 1999).
Sequence analysis is a central subset of the very broad area of computational biology. Sequence analysis, as it pertains to the determination of the amount and nature of sequence similarity, or homology, is especially important. Pattern discovery is an important part of sequence analysis.
There are numerous methods commonly used to search for and understand various forms of sequence homology. Among these are single- and multiple-sequence alignment (e.g. CLUSTAL) and sequence matching (e.g. BLAST) algorithms. Although these methods are extremely useful, they have limitations. In particular, there are many problems that seem to be characterized by low or undetectable homology at the sequence level, despite evidence of structural or functional similarity. Unfortunately, alignment-based methods tend to work best in the high-homology limit, and less well as homology decreases.
In view of the foregoing it is believed advantageous to be able to discover both pure and corrupted patterns within a given sequence and also between a family of sequences, regardless of homology.