1. Field of the Invention
The present invention generally relates to methods and systems for performing comparisons of genomic data. In particular, the present invention relates to methods and systems for providing a PQ tree that represents gene clusters that are common to at least two genomes.
2. Description of the Related Art
Genomes are sequences of genes. The inventors have multiple sets of genomes for which the inventors desired to study the similarity across these genomes. There is a theory that there is commonality across genomes due to a common origin. Genomes may be related to each other through a common ancestry.
Further, it is desirable to know the order of the ancestry for these genomes. The genomes, which were intermediates to the common origin and the current set of genomes, may not be known.
If a collection of genes in one genome appear together in another genome, there is a theory that these genomes are similar and that they may have a common ancestor.
Given two permutations of n distinct characters, a common interval may be defined to be a pair of intervals of these permutations consisting of the same set of characters, and an O(n+K) time algorithm for extracting all common intervals of two permutations may be devised, where:
                    K        ≤                  (                                                    n                                                                    2                                              )                                    (        1        )            is the number of common intervals.
An O(n) type algorithm is an algorithm for which the amount of time it takes to solve the algorithm is directly proportional to the size of the input.
This may be extended to k sequences and presented an O (nk+K) time algorithm for extracting all common intervals of k permutations. The characters here represent genes and the string of characters represents the genome.
A “consecutive” constraint may be relaxed by introducing gene teams—allowing genes in a cluster to be separated by gaps that do not exceed a fixed threshold and an O(kn log2 n) time algorithm may be used for finding all gene teams.
A conventional technique for deciding whether two genes are similar uses the biological concept of orthologs and paralogs. Two genes are matched if they are either orthologous (appear in different organisms, but have the same evolutionary origin and are generated during speciation) or paralogous (appear in the same organism and caused by the duplication of ancestor genes).
A slightly modified conventional model of a genome sequence, that allows paralogs extended a previous conventional model by representing genomes as sequences rather then permutations, and provides a Θ(n2) algorithm for extracting all common intervals of two sequences.
The notion of gene teams may be extended to cluster of orthologous group (COG) teams by allowing any number of paralogs and orthologs, and an O(mn) time algorithm was devised to find such cluster of orthologous group teams for pairwise chromosome comparison (where m and n are the number of orthologous genes in the two chromosomes).
Pattern discovery has conventionally been formalized as a r-pattern problem. Let the pattern P=p1 . . . . pm and the string S=s1 . . . sn both be sequences of characters (with possible repeats) over a given alphabet Σ (in our case genes). P appears in location i in S iff (p1, . . . , pm) is a permutation of (si, si+1, . . . , si+m−1). P is a π-pattern if it appears at least K times in S for a given K. A notation for maximal π-pattern was introduced as a model to filter meaningful from apparently meaningless clusters. A π-pattern p1 is non-maximal with respect to π-pattern p2, if each occurrence of p1 is covered by an occurrence of p2 and each occurrence of p2 covers an occurrence of p1. An algorithm works in two stages: In stage 1, all π-pattern of sizes ≦L are found in O(Ln log|Σ|log n) time, where n is the total length of all the sequences. For every π-pattern found the algorithm stores a list of all the locations where the pattern appears, e.g., a location list. In stage 2, a straightforward comparison of every two location lists is used to extract the maximal π-pattern out of all the π-pattern found in stage 1. Assume stage 1 outputs p π-patterns, and the maximum length of a location list is l, stage 2 runs in O(p2l) time. Integrating the two stages to produce only the maximal π-patterns was introduced as an open problem.
Conventional approaches output all the found patterns as sets of genes. These conventional approaches provide no knowledge of the ordering of the genes in each appearance of the pattern, and also outputs meaningless clusters.