1. Field of the Invention
The present invention generally relates to a system and method for identifying gapped permutation patterns, as within a genome sequence.
2. Description of the Related Art
As research in genomic science evolves, there is a rapid growth in the number of available complete genome sequences. To date about a dozen species of animals have had their complete Deoxyribonucleic acid (hereinafter “DNA”) sequence determined and the list is growing. The knowledge of gene positions on a chromosome, combined with the strong evidence of a correlation between the position and the function of a gene makes the discovery of common gene clusters invaluable firstly for understanding and predicting gene/protein functions and secondly for providing insight into ancient evolutionary events.
The list of species whose complete DNA sequence have been read, is growing steadily and it is believed that comparative genomics is in its early stages. Permutations on sequences representing gene clusters on genomes across species are being studied under different models, to cope with this explosion of data. The challenge is to intelligently and efficiently analyze genomes in the context of other genomes.
In recent years, this problem has been modeled and studied intensively by the research community. One conventional method for modelling a genome sequence allows only orthologous genes. This conventional method models genomes as permutations. This method defines a “common interval” to be a pair of intervals of two given permutations consisting of the same set of genes. The number of such intervals, NO, can be quadratic in the size of the input, NI. This method gives an optimal O(N1+N0) time algorithm based on a clever observation of a monotonicity of certain integer functions.
This result was then extended to m≧2 permutations by another conventional method. This method introduced the idea of “irreducible” intervals, whose number can be only linear in the size of the input. This method offered an optimal O(N1) time algorithm to detect all of the irreducible intervals and, based on this, the method provides an optimal O(N1+N0) algorithm for the general problem with m≧2 permutations.
The problem of identification of common intervals has been revisited by another conventional method which formalized the notion of distance-based clusters: this allows the genes in a cluster to be separated by gaps that do not exceed a pre-defined threshold. Such clusters are termed gene teams and they are constrained to appear in all given m sequences. This conventional method presents an O(mn log2 n) algorithm to detect the gene teams that occur in m sequences defined on n genes (alphabet).
A slightly modified conventional method for modeling a genome sequence allows paralogs and a pattern discovery framework was formalized for a setting in which a pattern is a gene cluster, that allows for multiplicity. In other words, paralogous genes within a cluster appearing a quorum of K>1 times. K is a quorum parameter that is usually used in the context of patterns.
Formally, the following problem has been addressed:
Given m strings (with possible multiplicity) on Σ of length ni each, and a quorum parameter K, the task is obtain all permutation patterns p that occur in at least K sequences.
This problem typically uses m=1, however, it can be trivially modified to deal with an m>1 scenario.
The algorithm for solving this problem is based on Parikh-maps and has a time complexity of O(Ln log|Σ|log n) where L (<maxi(ni)) is the size of the largest cluster and n=Σi(ni). Then, the notion of “maximal” permutation patterns or clusters was introduced. Maximality is a non-trivial notion requiring a special notation. In this notation the maximality has been represented by a PQ tree, and a linear time algorithm has been used to compute the maximality for each pattern (cluster).
Using a similar model of genomes as sequences, rather than permutations, another conventional method uses an Θ(n2) algorithm for extracting all common intervals of two sequences. This notion of gene teams was extended to clusters of orthologous genes (hereinafter “COG”) teams by allowing any number of paralogs and orthologs, and a O(mn) time algorithm to find such COG teams was devised for pairwise chromosome comparison where m and n are the number of orthologous genes in two chromosomes.
As the number of genomes under study grows in number, it becomes important to handle not only distance-based clusters (or gapped clusters), but also cluster co-occurrence in some subset and not necessarily all of the m genomes. Again, maximality is an important idea that not only preserves the internal structure, but also cuts down on the number of clusters without loss of information. The conventional systems have not been adequate in handling the above challenges.
Further, conventional systems and methods are not capable of handling wild cards or gaps and the occurrences of the sequences must be exact.