The present invention relates to methods for identifying hybridization targets of polynucleotide probes within polynucleotide databases. In particular, the present invention provides methods for determining the similarity of polynucleotide probes to polynucleotides in genomic databases, using thermodynamic scoring models.
The rational design of new pharmaceutical agents and therapies is increasingly based on the understanding of disease processes on a cellular and molecular level. For example, through understanding of genetic differences between normal and diseased individuals, differences in the biochemical makeup and function of cells and tissues can be determined and appropriate therapeutic interventions identified.
Accordingly, much effort has been dedicated toward mapping of the human genome, which comprises over 3×109 base pairs of DNA (deoxyribonucleic acid). While this exercise has largely been completed, relatively little is known about which of the estimated 30,000 human genes are specifically involved in any given biochemical process. The analysis of gene function will be a major focus of basic and applied pharmaceutical research over the coming years, toward the end of developing new medicines and therapies for treating a wide variety of disorders. However, the complexity of the human genome and the interrelated functions of many genes make the task exceedingly difficult, and require the development of new analytical tools.
A variety of tools and techniques have already been developed to investigate the structure and function of individual genes and the proteins they express. Such tools include polynucleotide probes, which comprise relatively short, defined sequences of nucleic acids, typically labeled with a radioactive or fluorescent moiety to facilitate detection. Probes may be used in a variety of ways to detect the presence of a polynucleotide sequence, to which the probe binds, in a mixture of genetic material. In general, the target sequence can be harbored by a longer nucleic acid molecule, e.g. a DNA restriction fragment, a PCR (polymerase chain reaction) amplicon, a mRNA (messenger ribonucleic acid) transcript, or a reverse-translated cDNA (complementary DNA) fragment. The detection of the target sequences usually implies the detection of the larger fragment.
Probes may be used as diagnostics, for detection of a particular genetic sequence in genetic material obtained from a subject. The effect of drugs on specific biologic processes (either with respect to efficacy or unwanted side effects) may also be monitored, by using probes to determine the effect of the drug on genes involved in the processes. Probes may also be used in the process of investigating unknown gene functions, such as in gene expression studies, and in genotyping and antisense assays.
The use of probes to monitor changes in gene expression may give insight into the role of specific genes in a given biological process. The amount of mRNA produced by a given gene is related to the involvement of the gene in a given biological process; genes that display an increase of expression activity during the process are likely to be involved in the process. However, correlating the function of any one gene with a biological process is complicated, since most processes are controlled or affected by a large number of genes. Thus, gene expression studies preferably monitor the expression of multiple genes simultaneously.
In order to simultaneously monitor the expression of a large number of genes, high throughput assays have been developed comprising microarrays of probes. Such microarrays comprise a large number of probes of known composition, bound to a substrate. Isolated tissue mRNA is amplified and reverse transcribed to produce cDNA, which is fluorescently labeled. The cDNA is then hybridized to the array, and the level of fluorescence at each probe is detected. The level of fluorescence is proportional to the amount of cDNA bound to the probe and, consequently, to the amount of mRNA in the tissue of interest. The design and application of assays among those known in the art are disclosed in Duggan, D. J., et al., “Expression Profiling Using cDNA Microarrays.” Nature Genetics Supplement Vol. 21, (1999): 10–14; Roses, D. A. “Pharmacogenetics and the Practice of Medicine.” Nature Vol. 405, (2000): 857–865; Lockhart, D. J., et al., “Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays.” Nature Biotechnology Vol. 14, (1996): 1675–1680; and Lockhart, D. J., and Winzeler, E. A. “Genomics, Gene Expression and DNA Arrays.” Nature Vol. 405, (2000): 827–836.
The specificity of the probes is essential for the microarray or hybridization-based assays to be meaningful. The utility of a probe to monitor a gene of interest is significantly diminished if it also binds to another gene. This problem is exacerbated when studying large genomes, with commensurately increased possibilities of encountering multiple genes that could bind to a probe that lacks sufficient specificity. Accordingly, a goal of hybridization assay design is to detect only the desired specific target sequence while minimizing interference or cross-hybridization with other polynucleotide sequences present in the polynucleotide mixture being analyzed. Cross-hybridization is typically due to the presence of limited base differences, as well as insertions and deletions within genomic sequences that are similar. The ability to reduce cross-hybridization becomes extremely important when many or all of the sequences present in the complex nucleic acid mixture are previously known, and the number of probes being designed is large (>100).
Insofar as binding of a probe to a polynucleotide target can be characterized according to well-defined rules, probe design can be reduced to a string-matching exercise, which is particularly amenable to computerization. Accordingly, a variety of computerized systems have been developed for analysis of genetic sequences. The use of computers to collect, organize and analyze genetic and protein sequences and associated information is generally known as “bioinformatics.” Various types of computer algorithms are described in the literature, such as Myer's grep algorithm, described in Myers, G. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the ACM Vol. 46, No. 3, (1999): 395–415. However these algorithms only return matches with a given number of mismatches from the query sequence. The mismatch sequences provided by fast algorithms are of initial interest, but fail to include alternate binding sites that contain insertions and deletions that can still cross-hybridize with the selected probe. Thus, approximate string-matching algorithms, although potentially fast, are not very sensitive to detect all alternative binding sites for a probe.
Other common search programs compute an alignment value or “score” for every sequence in the database that matches a given query sequence. The given score for a query sequence represents the degree of similarity between the query sequence and the database sequence. This score is generally calculated from the alignment of the two sequences, and is based on a substitution score matrix. A dynamic programming algorithm for computing the optimal local-alignment score was first described in Smith, T. F. and Waterman, M. S. “Identification of Common Molecular Subsequences.” J. Mol. Biol. Vol. 147, (1981): 195–197. This dynamic programming algorithm was later improved to include linear gap-penalty functions. Gotoh, O., J. Mol. Biol. Vol. 162, (1982): 705–708. Gaps are observed when, in a given alignment, some nucleotides of one sequence have no similar nucleotides in the other sequence. The example below shows an alignment with a gap of two and a gap of one.
CTGCCTGTCCCAATGCTC-AGCCSEQ. ID. NO. 1||||||||||| ||||| ||||CTGCCTGTCCC--TGCTCCAGCCSEQ. ID. NO. 2Gap penalty functions are linear functions of the type:penalty=initiation+b*extension                where the term “initiation” is defined as the penalty for gaps of one, the term “extension” is the penalty for any subsequent gap length increase and “b” is the length of the gap minus one.        
The similarity scoring scheme used by presently known algorithms works well when the purpose of the search is to look for homologous (i.e. evolutionary related) sequences in the databank. However, the scoring scheme does not translate directly to the strength of the probe binding to detected sites. Thus, these algorithms may fail to identify the binding of probes to sequences that are not homologous, yet exhibit strong binding affinities.
An alternative approach to the current model is to use thermodynamic parameters to score the interaction affinity between a gene probe and potential targets. These approaches evaluate the binding strength of two sequences by computing the sum of the interactions existing within each couple of successive pairs along the sequences. Algorithms and thermodynamic parameters among those known in the art are disclosed in Gray, D. M., and Tinoco, I., Jr. “A New Approach to the Study of Sequence-Dependent Properties of Polynucleotides.” Biopolymers Vol. 9, (1970): 223–244; SantaLucia, J., Jr. “A Unified View of Polymer, Dumbbell, and Oligonucleotide DNA Nearest-Neighbor Thermodynamics.” Proc. Natl. Acad. Sci. USA Vol. 95, (1998): 1460–1465; Allawi, H. T. and SantaLucia, J., Jr. “Thermodynamics and NMR of Internal G•T Mismatches in DNA.” Biochemistry Vol. 36, (1997): 10581–10594; Allawi, H. T. and SantaLucia, J., Jr. “Nearest Neighbor Thermodynamic Parameters for Internal G•A Mismatches in DNA.” Biochemistry Vol. 37, (1998): 2170–2179; Allawi, H. T. and SantaLucia, J., Jr. “Nearest-Neighbor Thermodynamics of Internal A•C Mismatches in DNA: Sequence Dependence and pH Effects.” Biochemistry Vol. 37, (1998): 9435–9444; Allawi, H. T. and SantaLucia, J., Jr. “Thermodynamics of Internal C•T Mismatches in DNA.” Nucleic Acids Research Vol. 26, No. 11, (1998): 2694–2701; Peyret, N., et al., “Nearest-Neighbor Thermodynamics and NMR of DNA Sequences with Internal A•A, C•C, G•G, and T•T Mismatches.” Biochemistry Vol. 38, (1999): 3468–3477; Peyret, N., and SantaLucia, J., Jr. “Prediction of Nucleic Acid Hybridization: Parameters and Algorithms.” Abstract of Dissertation, Wayne State University, Detroit, Mich.; Peterson, J. C., et al., “Sequence Information Signal Processor for Local and Global String Comparisons,” California Institute of Technology, Pasadena, Calif., USA; U.S. Pat. No. 5,632,041 (1997); and Kane, M. D., et al., “Assessment of the Sensitivity and Specificity of Oligonucleotide (50 mer) Microarrays.” Nucleic Acids Research Vol. 28, No. 22, (2000): 4552–4557.
Algorithms among those known in the art evaluate probe/target thermodynamics at every possible point of binding, such as by computationally “walking” the probe along the target, shifting the position of the probe by one nucleotide at each step. Such techniques are extremely computationally demanding, and inefficient. Moreover, many of the algorithms are unable to take into account gaps and other computational exceptions. Database searches using algorithms are also unfortunately quite slow on ordinary computers. Thus, heuristic alternative programs have been developed, such as “FastA” (Fast Alignments), Pearson, W. R., and Lipman, D. J. “Improved Tools for Biological Sequence Comparison.” Proc. Natl. Acad. Sci. USA Vol. 85, (1988): 2444–2448, and “BLAST” (Basic Local Alignment Search Tool), Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research Vol. 25, No. 17. (1997): 3389–3402; Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. “Basic Local Alignment Search Tool.” J. Mol. Biol. Vol. 215, (1990): 403–410.
Although these methods improve the speed of the search by a factor of up to 40 compared with the Smith-Waterman algorithm, they do so at the expense of sensitivity. Due to the loss of sensitivity, some significant “hits” that would indicate alternative binding sites for a probe are not detected using the heuristic algorithms with their standard parameters.
Accordingly, there is a need for an efficient computational method for determining the binding sites of a given probe to a targets in a genome or other composite of polynucleotides. The method should have sufficient sensitivity to find all binding sites of interest, yet process information quickly. Further, the processing method should be designed to be compatible with conventional computer equipment (e.g., readily available personal computers). Such methods preferably take into consideration binding site strength for not only primary binding sequence targets, but alternate sites that include mismatch pairs, insertions, and deletions within the nucleic acid target sequence.