1. Field of the Invention
The present invention generally relates to methods and systems for reconstructing genomic ancestors. In particular, the present invention relates to methods and systems that reconstruct genomic common ancestors using a PQ tree.
2. Description of the Related Art
Various international efforts are underway to catalog the genomic similarities and variations in the human population. As the study progresses, data in the form of genomic markers is becoming available, with due respect to individual and group privacy, for public study and use. Combined with recent discoveries of inversion and transposition within the human species, this opens up the potential for using large-scale rearrangements to reconstruct the genealogy tree of the human population.
The specification provides a brief summary of discovered inversions and transpositions within the human population and the computational methods being used by the bio-informatics community to tackle the problem of reconstructing phylogeny trees.
Inversions along a chromosome are frequently observed by comparing closely related species: for example, a comparison between a chimpanzee chromosome and a human chromosome, or a mouse chromosome and a human chromosome. These are generally very long inversions that are observed as reversed gene orders.
Moreover, with the most recent builds of the chimpanzee genome, a total of 1,576 putative regions of inverted orientation, covering more than 154 mega-bases, of all sizes between the human and chimpanzee genomes have been observed. However, inversions have been seen across humans: X chromosome and a 3 Mb inversion on the short arm of the Y chromosome. Human inversions occur at a low but detectable frequency. The ones that are large enough to be detected by conventional cytogenetic analysis occur at a frequency of 1-5 per 10,000 individuals. The inversions across humans are of particular interest, since often the recombination in the inverted segments in heterozygotes lead to heritable disorders.
Secondly, inversions also have a potential for explaining the geographic distribution of the human population: a reconstruction of the prehistoric human colonization of the planet. The X-chromosome inversion is seen in populations of European descent at a frequency of about 18%.
Further, large chromosomal segment inversions have been seen in humans. A paracentric inversion polymorphism spanning larger than a 2.5 Mb segment in chromosome band 8p23.1-8p22 and a 900-Kb inversion on chromosome 17q21-31 have been reported. The second inversion is seen at the rate of 20% in Europeans and almost absent in East Asians and rare in Africans.
Large chromosomal rearrangement polymorphisms, such as, for example deletions or duplications, are apparent by a loss or gain of heterozygosity. However, inversions are difficult to detect and may go unnoticed if the inverted segment is small.
The inversions may occur in coding, non-coding, or intra-gene regions of the chromosome. Hence, a model that tracks the gene orders of the chromosome is inadequate for modeling segment inversions. Instead, these inversions are being discovered and reported in terms of the order of the labeled short tandem repeat polymorphisms.
Further, unlike genes, these markers are not signed. Also, the ancestral segment is unknown. In other words, it is unclear which order of the segment came first.
Translocations have also been observed in humans although these have been mostly of single genes and generally associated with a disorder. It is believed that as individual differences are learned, more such variations, transpositions or inversions, will surface. In fact, these (inversions) may be only the tip of the iceberg.
FIG. 1 illustrates a short tandem repeat polymorphism on two human chromosomal segments. The blocked segment shown here is inverted in a significant fraction of the human population.
Loosely speaking there are two conventional computational approaches to studying the evolutionary relationships of genomes, one of studying the individual gene sequences and the other of studying the arrangement of multiple genes on the genome. A very large amount of literature exists for the first approach (including sequences under the character model), which are not described here to avoid digression.
The second approach of the description of chromosomal inversions in Drosophila had appeared way back in early part of last century. An active interest has been taken in the study of genome rearrangements in the last decade resulting in some very interesting observations and debates in the community.
In the context of genome rearrangements, genomes are viewed as permutations where each integer corresponds to a unique gene or marker. For mono-chromosomal genomes, the most common rearrangement is inversion that is often called reversal in the area of bio-informatics. Without loss of generality, a permutation of length n with i≦j, can be written as π1, the inversion on π1 defined as rij(π1) and the transposition on π1 defined below as tijk(π1) where the underlined portion is the reversed or transposed segment.π1=p1p2 . . . pi−2pi−1pipi+1pi+2 . . . pjpj+1pj+2 . . . pkpk+1 . . . pn rij(π1)=p1p2 . . . pi−2pi−1pjpj−1pj−2 . . . pipj+1pj+2 . . . pkpk+1 . . . pn tijk(π1)=p1p2 . . . pi−2pi−1pj+1pj+2 . . . pkpipi+1pi+2 . . . pjpk+1 . . . pn 
Clearly, rij(rji(π))=π leading to the idea of a shortest inversion path between two permutations. This shortest inversion path between π1 and π2 is the distance between the two given as Dr(π1, π2). However, computing Dr(π1, π2) for a given pair of permutations π1 and π2 is NP-complete. It has been shown that by supplementing the genes with signs, this problem could be solved in polynomial time by using graph structures termed “hurdles” and “fortresses.”
In sequences, the problems of multiple sequence alignment and the construction of the implicit phylogeny tree, have been conventionally separated for simplicity. Such a distinction under the genome rearrangement model is not so obvious. However, breakpoint phylogeny was introduced to study this problem under a simplified cost function of minimizing the number of breakpoints.
Heuristic approaches also have conventionally been applied to this problem. A rich body of literature on-inferring phylogenies under the sequence or character models exists, including attempts at using sequence and distance based methods to genome rearrangement problems
In this context, a key observation is that the “distance” between two members, or member and ancestor, within the species is small.