The present invention relates to methods of comparative genomic analysis. Specifically, the invention provides a novel method to rapidly identify rearrangements within a test genome, e.g., a tumor genome, in comparison with a substantially-sequenced reference genome.
The ability to compare related genomes has long been a major goal of biological research. For example, because tumor genomes are known to have many rearrangements (e.g., amplifications, deletions, transpositions, translocations, episomes and double minutes) that can contribute to tumor progression, the ability to identify such rearrangements by comparing tumor genomes with normal genomes would allow the identification of cancer-causing genetic alterations. Indeed, numerous cancer genes have been identified on the basis of their localization to specific chromosomal rearrangements. In addition, closely related species, or even different individuals or strains within a single species, can differ by virtue of genomic rearrangements that can play important roles in causing the phenotypic differences between the species or strains. Thus, the identification of genes affected by rearrangements between species or strains would allow the identification of genetic events that accompany speciation or the establishment of strain-specific phenotypic differences. In each of these cases, genomic rearrangements can produce profound effects on a cell or individual because they result in, e.g., gene mutation, deletion, or the creation of novel chimeric genes with altered or enhanced function. The identification of genes that are affected by such rearrangements would thus enable, inter alia, the development of useful diagnostic and prognostic markers, and would suggest targets for therapeutic intervention (see, e.g., Ehrlich, M. (2000) DNA Alterations in Cancer, Genetic and Epipenetic Changes. Eaton Publishing, Natick, Mass.).
Traditionally, efforts to compare related genomes has relied either on the comparison of individual sequences within the genomes, or on the detection of rearrangements based on cytogenetic analysis, including traditional cytogenetic methods based on, e.g., G-banding, silver staining (NOR), and C-banding, as well as more recent tools such as fluorescence in situ hybridization (FISH), representational difference analysis (RDA), restriction landmark genome scanning (RLGS), high-throughput loss of heterozygosity (LOH) and comparative genome hybridization (CGH; see, Kallioniemi et al. (1992) Science 258: 818-21) (see, e.g., Gray and Collins, (2000) Carcinogenesis 21:443-452).
While each of these widely-used methods have enabled significant advances, none of them, however, provides a high-resolution method for rapidly, efficiently, and systematically identifying any type of rearrangement within a test genome in comparison to a sequenced referenced genome. Further, this deficiency is currently becoming more and more acute because of the large number of genome sequences that have already been determined, and because of the even larger number of genome projects that are still in progress. While the availability of genome sequences for virtually any organism will soon allow the systematic sequence-based comparison of related genomes, the only currently-available method for doing this requires the complete sequencing of each genome involved in the comparison. While such comparisons will thus be technically possible, the cost of sequencing an entire genome will remain a significant impediment to such studies for the foreseeable future. Clearly, there is a great need for new and more efficient sequence-based approaches for the comparison of related genomes. The present invention addresses these and other needs.
The present invention provides a novel method for identifying rearrangements in a test genome, e.g., a tumor genome, when compared to a reference genome. This method represents a major improvement over previous methods in terms of efficiency, rapidity, and cost-effectiveness. The present method involves generating a library from a test genome, sequencing the ends of the inserts in the library, and comparing the co-linearity of the sequenced ends in the library with corresponding sequences in a reference genome. This invention is useful for any of a number of applications, including for the identification of rearrangements within tumor genomes, between closely related species, and between different strains of the same species.
In one aspect, the present invention provides a method for comparing a test genome to a reference genome, the method comprising (i) providing a plurality of clones of known size that substantially cover at least a portion of the test genomc; (ii) obtaining sequence information from the termini of each of the plurality of clones; (iii) identifying a pair of sequences within the reference genome that corresponds to each pair of terminal sequences; and (iv) determining the relationship between the members of each pair of corresponding sequences within the reference genome; wherein a difference in the observed relationship between the members of any of the pairs of corresponding sequences within the reference genome and the expected relationship based upon the known size of the plurality of clones indicates the presence of a rearrangement in the test genome compared to the reference genome.
In one embodiment, the method further comprises determining the sequence of the test genome over a region spanning at least one breakpoint of the rearrangement. In another embodiment, the reference genome is a human genome. In another embodiment, the test genome is from a tumor cell. In another embodiment, the reference genome and the test genome are from difference species. In another embodiment, the plurality of clones covers substantially all of the test genome.
In another embodiment, the members of at least one pair of corresponding sequences within the reference genome are closer together than expected based on the known size of the plurality of clones, indicating the presence of an insertion in the test genome between the pair of terminal sequences. In another embodiment, the members of at least one pair of corresponding sequences within the reference genome are further apart than expected based on the known size of the plurality of clones, indicating the presence of a deletion in the test genome between the pair of terminal sequences. In another embodiment, the members of at least one pair of corresponding sequences within the reference genome are present on different chromosomes in the reference genome, indicating the presence of a translocation in the test genome between the pair of terminal sequences. In another embodiment, the method further comprises determining the frequency of each of the terminal sequences, wherein a change in the relative frequency of any of the terminal sequences indicates the presence of an amplification or a deletion in the test genome that includes the terminal sequence. In another embodiment, at least one member of at least one pair of terminal sequences in the test genome is present at a greater than expected frequency in the plurality of clones, indicating the presence of an amplification in the test genome that includes the at least one member of the at least one pair of terminal sequences. In another embodiment, at least one member of at least one pair of terminal sequences in the test genome is present at a lower than expected frequency in the plurality of clones, indicating the presence of a deletion in the test genome that includes the at least one member of the at least one pair of terminal sequences.
In another embodiment, the plurality of clones are BAC clones. In another embodiment, the plurality of clones are PAC clones. In another embodiment, the plurality of clones represents a redundancy of at least about 10 fold of the test genome or the portion of the test genome. In another embodiment, the plurality of clones represents a redundancy of at least about 20 fold of the test genome or the portion of the test genome. In another embodiment, the terminal sequences are present on average between about every 5 kb to about every 500 kb throughout the test genome or the portion of the test genome. In another embodiment, the terminal sequences are present on average every 50 kb or less throughout the test genome or the portion of the test genome. In another embodiment, the terminal sequences are present on average every 10 kb or less throughout the test genome or the portion of the test genome. In another embodiment, the terminal sequences are present on average every 5 kb or less throughout the test genome or the portion of the test genome. In another embodiment, the reference genome is a human genome and the plurality of clones comprises at least about 50,000, 100,000, 200,000, 250,000, or more clones. In another embodiment, the terminal sequences are determined by automated sequencing. In another embodiment, the pairs of terminal sequences from the test genome are compared to the pairs of corresponding sequences within the reference genome using a computer.