Genomes evolve both by acquiring new sequences and by rearranging/mutating existing sequences.
Rearrangements of genomes are sponsored by processes internal to the genome. One cause is unequal recombination which results from mis-pairing by the cellular systems for homologous recombination. Non-reciprocal recombination results in duplication or rearrangement of loci. Duplication of sequences within a genome provides a major source of new sequences. One copy of the sequence can retain its original function while the other may evolve into a new function. Furthermore, significant differences between individual genomes are found at the molecular level because of polymorphic variations caused by recombination.
Another major cause of variation is provided by transposable elements or transposons. These are discrete sequences in the genome that are mobile, i.e. they are able to transpose themselves to other locations within the genome. The mark of a transposon is that it moves directly from one site in the genome to another. Unlike most other processes involved in genome restructuring transposition does not rely on any relationship between the sequences at the donor and recipient sites. Transposons may provide a major source of mutations in the genome.
Transposons fall into two general classes. The first class includes transposons which exist as sequences of DNA coding for proteins that are able directly to manipulate DNA so as to propagate themselves within the genome. The second class of transposons are related to retroviruses and the source of their mobility is the ability to make DNA copies of their RNA transcripts; the DNA copies then become intergrated at new sites in the genome. These transposons are often termed retroposons, retrotransposons or retroviral-like elements (RLEs).
Mobile elements make up over 45% of the human genome. These elements continue to amplify and, as a result of negative effects of their transposition, they contribute to numerous human diseases (Deininger and Batzer, 1999; Ricci et al., 2003; Sorek et al., 2002). All eukaryotic genomes contain mobile elements, although the proportion and activity of the classes of elements are generally thought to vary widely between genomes. They use extensive cellular resources in their replication, expression and amplification. There is considerable debate as to whether they are primarily an intracellular plague that attacks the host genome and exploits cellular resources, or whether they are tolerated because of their occasional positive influences in genome evolution.
Transposable elements can promote rearrangements of the genome, directly or indirectly. The transposition event itself may cause deletions or inversions or lead to the movement of a host sequence to a new location. Further transposons serve as substrates for cellular recombination systems by functioning as “portable regions of homology”; two copies of a transposon at different locations (even on different chromosomes) may provide sites for reciprocal recombination. Such exchanges result in deletions, insertions, inversions or translocations.
The inventor's earlier application, granted as U.S. Pat. No. 6,383,747, describes methods of analysing ancestral haplotypes. Ancestral haplotypes are DNA sequences from multigene complexes such as the Human Major Histocompatibility Complex (MHC), a region of chromosomal DNA which plays a key role in the immune system and influences diverse functions and diseases. The MHC contains multiple polymorphic and duplicated genes (Zhang et al., 1990). The method relies inter alia upon the presence of duplications which are imperfect. The ancestral haplotypes of the MHC extend from HLA B to HLA DR and have been conserved en bloc. These ancestral haplotypes and recombinants between any two of them account for about 73% of ancestral haplotypes in the caucasian population. Other multigene complexes containing ancestral haplotypes include the lipoprotein gene complex and the RCA complex.
The most common approach in species identification focuses on two regions of the mitochondrial genome, the D-loop and cytochrome B (Cyt-B) gene (Branicki et al., 2003). Due to the mutation rate of mitochondrial DNA, it is commonly used to examine species difference. As such, Cyt-B studies have demonstrated success for a wide array of species although problems such as variable amplification efficiencies and an inability to differentiate between closely related species have been observed in some cases. In general, the approach relies on sequencing and subsequent sequence comparison against an available genetic database to identify the origin of the sample. Other methods include the analysis of nuclear targets such as beta actin, 28sRNA and TP53 genes, as well as the use of Short Tandem Repeats and Rapid Amplification of Polymorphic DNA. Regardless, there is a need for efficient methods for the identification of species using genetic analysis.
Large datasets are being created in fields as diverse as atmospherics, population ecology, forensics, particle physics, fluid mechanics, genomics and proteomics. Because of their size, the analysis of internal structure and patterns within large datasets requires substantial computer power. For example, in evolutionary genomics where structural patterns are compared between different species, the data strings are sequences of DNA in the order of 2-3 gigabases, hence the number of possible permutations comparing only small sequence sets is immense, and comparing large sequence sets is beyond all but the world's most powerful computers. In order to reduce the run time of analysis, options being developed include constructing larger computers such as massive parallel arrays or smaller processors in large clusters, which adds cost and an increased amount of hardware.
There is a need for further methods of genetic analysis which can be used to produce a profile which provides information regarding genomic DNA in a test sample. Here we describe a particular form of genetic analysis relying on the amplification of complementary duplicons. The present invention also seeks to provide a method of comparing large datasets, such as sequences of genomic DNA for the purposes of ascertaining duplication between portions of the compared sequences in a simpler and cost-effective manner than previously performed.