Conventional hybridization studies with genome-derived nucleic acid probes require unlabeled Cot-1 DNA fractions to block cross-hybridization of repetitive sequences contained within these probes in eukaryotic genomes. This is necessary, because to achieve the specificity needed to identify, detect or quantify unique sequences contained in nucleic acid probes, confounding hybridization from repetitive sequences must be eliminated. Repetitive sequences comprise at least 50% of the human genome and contain a diverse set of distinct families (Smit, Curr Opin Genet Dev. 1996, 6(6):743-8). Despite the lack of selection for their function and broad, often variable degrees of orthology, such sequences often display sequence conservation throughout mammalian evolution (Rogan et al. Mol Biol Evol. 1987, 4(4):327-42; Mottez et al. Nucleic Acids Res. 1986, 14(7):3119-36), principally because they have properties of semiautonomous transposable elements that promote frequent amplification during host organism evolution, originally termed molecular drive by Dover (Dover, Trends Genet. 2002, 18(11):587-9). It is desirable to remove such sequences in most clinical diagnostic applications; because of their ubiquity throughout the genome, their presence can interfere with the development of probes for unique regions of the genome that correspond to functional genes whose structures must be preserved because they are essential for normal development and health.
Repetitive sequences are often interspersed with unique or single copy genes, especially in eukaryotic genomes, and their removal from genomic probes is essential to ensure that diagnostic probes specifically recognize only a single location in the genome. These sequences can be eliminated by laboratory techniques designed to sequester them away from labeled probes containing both single copy and interspersed repetitive sequences (Lichter et al. Hum Genet. 1988, 80(3):224-34; Craig et al. Hum Genet 1997, 100:472-476), by blocking their hybridization, or by deducing the single copy sequences by comparisons of known genomic reference sequences with comprehensive databases of consensus sequences that are representative of established repetitive sequence families and subfamilies (Jurka, Curr Opin Struct Biol. 1998, 8(3):333-7).
Cot-1 DNA is often used to attempt to suppress cross-hybridization of repetitive sequences to probes. The problem with attempting to suppress repeat hybridization with Cot-1 DNA is that it can result in enhanced non-specific hybridization between probes and genomic targets. Specifically, it has been demonstrated that Cot-1 added to target DNA actually enhanced hybridization to genomic probes containing conserved repetitive elements (Newkirk, H. L. et al., Nuc. Acids Res. 2005, 33(22):e191). In addition to repetitive sequences, Cot-1 was also found to be enriched for linked single copy sequences (Newkirk, H. L. et al., Nuc. Acids Res. 2005, 33(22):e191). Adventitious association between these sequences and probes distorts quantitative measurements of the probes hybridized to desired genomic targets. This also affects the reproducibility of hybridization assays with sources of genomic DNA, in particular, and can also impact hybridization to mRNAs that contain repetitive sequences (typically found in the untranslated regions of transcripts). The increased non-specific hybridization that occurs when using Cot-1 to block repeat sequence hybridization has particularly adverse effects on microarray studies which depend on quantification of signals obtained by hybridization to the unblocked presumably single copy sequences.
The elimination of Cot-1 DNA, either by sequestering repeats or by blocking their hybridization, was accomplished by direct synthesis of probes lacking repeat sequences. Knoll et al., U.S. Pat. No. 6,828,097 (termed '097 patent), discloses a procedure for determining the locations of single copy intervals and design of probes for hybridization to their complementary locations in the human genome. It is disclosed that the procedure can be implemented for any genome in which a comprehensive catalog of repetitive sequences is available. Presumed single copy sequences containing repetitive elements will cross-hybridize to multiple locations in the genome. Where hybridization occurs in too many genomic locations, the lack of specificity adversely impacts the utility of the probes in diagnosing disease. Therefore, methods from which single copy sequences can be deduced without requiring a comparison of the genomic sequence with a comprehensive database of consensus repetitive sequence family members would represent an improvement over current in silico methods of identifying single copy intervals and the ensuing probes.
Methods have been developed which can align the sequences of different, related, or the same complete genomes from which the locations of individual repetitive sequences in the genome can be inferred. One such example is the maximal unique matching algorithm which builds suffix trees from all maximal length unique matches (MUM) between sequence strings (Delcher et al. Nuc. Acids Res. 1999, 27:2369-2376). Repeats can be detected in a genome because they are found in overlapping MUMs that are not necessarily contiguous in that genome. Once repeat sequence elements are identified through such comparisons, families of related repeat sequences can be identified through comparisons of individual family members with the genome sequence itself. Another popular method, the BLAT algorithm (Kent et al. Genome Res. 2002, 12:656-64), is a rapid alignment method that uses a hash-index algorithm to quickly find sequences similar to a particular test sequence in a genome; it is not, however, an ab initio approach for single copy sequence identification. Other comparative alignment tools useful for detecting repeat sequences include ASSIRC (Vincens et al. Bioinformatics 1998, 14:715-725), DIALIGN (Morgenstern et al Bioinformatics. 1998, 14(3):290-4.), DBA (Jareborg et al. Genome Res. 1999, 9(9):815-24), GLASS (Batzoglou et al. Genome Res. 2000, 10(7):950-8), LSH-ALL-PAIRS (Buhler, Bioinformatics. 2001, 17(5):419-28), MEGABLAST (Zhang J Comput Biol. 2000, 7(1-2):203-14), PIPMaker (Schwartz et al. Genome Res. 2000, 10(4):577-86), SSAHA (www.sangerac.uk/Software/analysis/SSAHA), and WABA (Kent and Zahler Genome Res. 2000, 10(8):1115-25).
U.S. application Ser. No. 10/229,058 discloses that sequences can be screened for the presence of known repetitive sequence families (e.g., Alu elements); however the details of these screening procedures are not disclosed. U.S. application Ser. No. 10/132,002 discloses a procedure for detecting repetitive sequences experimentally, but does not disclose the identification of single copy sequences. U.S. application Ser. No. 10/833,954 discloses that in situ hybridization of a mixture of single copy and repetitive sequences can be performed in the absence of blocking nucleic acids that prevent cross hybridization of repetitive sequences. A formulation of a hybridization reagent and washing conditions that could mitigate such cross-hybridization are disclosed, but no information is provided regarding the location of single copy and repetitive sequences within the probe segment. U.S. Ser. No. 10/132,993 discloses laboratory chromatographic methods to remove repetitive sequences from genomic DNA to make probes that are substantially complementary to single copy intervals. In this application, the locations or the specific single copy sequences are not determined prior to experimentally removing the repeat sequences. A very similar approach is described in U.S. application Ser. No. 10/798,949, in which repetitive sequences are subtracted by hybridization, and single copy sequences are subsequently amplified using so called unique sequence primers. Subtraction hybridization is not a robust technique, because low- to middle-reiteration frequency repeats are not completely eliminated under the hybridization conditions typically used in these studies. Therefore, the selection of these primers could result in the production of probes that are contaminated with repetitive sequence elements. Similarly, in U.S. application Ser. No. 10/229,058, the repetitive sequences are fractionated by hybridization methods prior to library production and sequencing. Presumably, the single copy sequences would be revealed after library enrichment; however U.S. Ser. No. 10/229,058 does not teach how to identify the precise boundaries of these sequences in the genome, and it does not teach the method of determining how to identify single copy sequences for use as probes. U.S. Ser. No. 10/330,089 is the most recent of several continuation applications which infer the single copy nature of cloned sequences by their lack of hybridization to total genomic DNA, which is highly enriched in repetitive elements. The specific single copy sequences are not revealed by this approach. Furthermore, the present applicants have demonstrated that the single copy sequences produced according to this method are contaminated with repetitive sequences, since they are particularly insensitive to the detection of low- to moderate-abundance repetitive sequence family members. See U.S. Pat. No. 6,828,097, Prosecution History.
While several of these approaches can find locally similar repetitive sequences without comparison to a library of sequences (as in Knoll et al., U.S. Pat. No. 6,828,097), their objective is to identify repetitive sequences and multiple copies of related sequences found in the genomes of different individuals or species. These approaches do not involve the use of repetitive sequences to infer the presence of single copy sequence intervals (between adjacent repetitive sequences in the genome) for the development of useful single copy probes from the intervening regions between the deduced repetitive sequences. These algorithms therefore produce libraries similar to that used in the '097 patent, and the sequences contained in these libraries will be similar to those already known. These algorithms do not describe inferred single copy intervals, or in particular, the use of probes obtained from those deduced intervals.