This invention relates to the fields of genetic engineering, microbiology, and computer science, and more specifically to an invention that helps the user, whether they be a molecular biologist or a clinical diagnostician, to calculate and design extremely accurate oligonucleotide probes for DNA and mRNA hybridization procedures. These probes may then be used to test for the presence of precursors of specific proteins in living tissues. The oligonucleotide probes designed with this invention may be used for medical diagnostic kits, DNA identification, and potentially continuous monitoring of metabolic processes in human beings. The present implementation of this computerized design tool runs under Microsoft.RTM. Windows.TM. v. 3.1 (made by Microsoft Corporation of Redmond, Wash.) on IBM.RTM. compatible personal computers (PC's).
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned hereunder are incorporated herein by reference.
To isolate a specific gene for any particular purpose, a researcher first has to have some idea of what he or she is looking for. To do this, the researcher needs to have a probe, which acts like a molecular hook that can identify and latch onto (i.e., bind to or hybridize with) the desired gene in a crowd of many other genes. A researcher who can obtain an entire strand of mRNA can eventually find the gene from which it was copied, using complementary DNA (cDNA, which is a cloned equivalent to RNA and somewhat equivalent to mRNA) as a probe to search through the great mass of genetic material and locate the desired original gene. cDNA essentially is manufactured or non-naturally occurring DNA from which all of the nonessential DNA has been removed. cDNA allows the researcher to concentrate entirely on the important portions of the gene being examined. The nonessential DNA regions are easy to recognize because when the gene is translated into protein, these regions do not wind up reflected in the protein sequence. These regions are called introns, or intervening regions. mRNA has no introns because they have been "spliced" out of the mRNA before translation. Thus, mRNA and cDNA contain only the essential information from a gene (called the exons). cDNA is the equivalent of mRNA with a complementary sequence, only the exons are present. cDNA may be produced by reverse transcription of mRNA.
The procedure of using cDNA from known mRNA as a probe to search through genetic material and locate the original gene is called molecular hybridization, and is currently one method of identifying specific genes. However, this method is less than perfect, can be extremely time consuming, and often is not even feasible because the researcher actually has to have an entire strand of cDNA from the desired gene before he or she can attempt to use this cDNA to locate and identify the particular gene. Thus, it is something of a circular problem. If the researcher cannot obtain an entire strand of mRNA or cDNA from the desired gene, then he or she must somehow design a probe from scratch to be used to identify that gene.
Oligonucleotide probes (that is, probes made up of a small number of nucleotides, such as 17 to 100), are increasingly being used to identify specific genes from genomic or cDNA libraries when the partial amino acid sequences is known. (von Heijne 1987, Ref. 15). This is a second method of determining a proper probe. Although the present implementation of this invention does not deal with cases in which the proteins have been sequenced, but rather only the DNA or mRNA, it is possible that this invention or a future implementation of it might be used with protein sequences. Such probes can also be used as primers which, when annealed to mRNAs, can be selectively extended into cDNAs. (von Heijne 1987, Ref. 15).
Because of these situations, the problem that the researcher faces is to discover or design a probe or mixture of probes that maximizes the researchers chances of successful hybridization while at the same time minimizing the amount of time and money that has to be spent on discovering or designing the probes. (von Heijne 1987, Ref. 15). Researchers in the field have determined that computer analysis can greatly expedite and simplify the search for optimal probe sequences. (von Heijne 1987, Ref. 15). However, all of the search strategies known to the present inventors are time consuming (both CPU and user time) and may be somewhat inaccurate. As stated in von Heijne, "a true optimization of the probe in terms not only of degeneracy but in terms of length, codon usage, Guanine-Cytosine (GC) avoidance, and expected signal-to-noise ratio (hybridization to target over background) is a fairly complex problem, however, and does not seem to have been automated so far." (von Heijne 1987, Ref. 15). Various search strategies known and used in the field to identify and design probes are outlined in the following sources: Lewis (1986, Ref. 9), Raupach (1984, Ref. 11), Yang et al. (1984, Ref. 16), and Martin and Castro (1984, Ref. 10).
In the simplest version of a protein-related search strategy, the search procedure is limited to finding a set of probes of given lengths with the least possible degeneracy simply by scanning the amino acid sequence and noting the number of alternative codons in the corresponding oligonucleotide as the scan moves along the chain of nucleotides. (Lewis 1986). The researcher can also include codon usage statistics (because more than one codon can translate to the same amino acid), which would attach a probability-of-occurrence value to each probe. (Raupach 1984, Ref. 11).
A more advanced algorithm would allow the researcher to specify the way in which he or she plans to synthesize the probes (for example, by adding toohomers or mixtures of monomers). It would also be easy for a researcher to add a rough estimate of the disassociation (or melting) temperatures of each probe to a program such as this.
One way to solve the problem of finding local similarities between two proteins being compared that has been discussed in the relevant literature is to use list-sorting or hashing routines. (von Heijne 1987, Ref. 15). These routines are based on the construction of a list or lookup table of k-letter words or k-tuples (i.e., all possible di- or trinucleotides), and the positions where they appear in the sequences being compared. This method is employed in some of the most extensively used "fast search" programs (see examples identified in von Heijne 1987, Ref. 15).
Two general methods of designing probes are common in the field, depending upon whether the researcher is trying to design a common probe or a specific probe. Common probes attempt to find common or consensus sequences among various species and among family genes. The first step in designing such a probe is to find the genes of interest. This may be done by performing a keyword or homology search against the GenBank (a genome database available from IntelliGenics of Mountain View, Calif.) or a keyword search against MEDLINE (the database currently available from the U.S. National Library of Medicine under the data access system known as Dialog of Dialog Information Service, Inc., Palo Alto, Calif.) or by performing a homology analysis between one of the genes of interest and whole GenBank sequences. The next step is to retrieve all of the relevant genes of interest. In the third step, multiple alignment analysis can be done using a commercially available software package such as DNASIS (from Hitachi Software of Brisbane, Calif.), which is an autoconnect program. In this step, the computer identifies which nucleotides are common among the requested sequences: ##STR1## Alternatively, after homology analyses between two sequences are carried out, data from the multiple homology analyses can be combined. The researcher then manually has to find the common or consensus region: ##STR2##
Next, the researcher would input the sequence of the common region into the program and then analyze the secondary structure (i.e., the stacking site and the hairpin structure). After this, the researcher manually would select several candidate probes (from five to ten) which contain the minimal hairpin structure and specific length according to the user's interest. A hairpin is an area in which a probe has "folded back" and one portion of the probe has hybridized with another portion of the same probe. The researcher would then perform a homology analysis between each candidate probe and all sequences in the GenBank to find all possible cross-hybridizable genes. Lastly, the researcher manually would decide which is the best candidate probe by determining which probe is highly homologous among the group of interest, but quite different from other unrelated sequences in the GenBank.
The conventional methods for designing common oligonucleotide probes using currently available computer software have at least five problems: (1) they involve time consuming multiple processes; (2) it is difficult to control a significant variable, the melting temperature Tm of the oligonucleotide probes; (3) the methods do not recognize exons and introns and differentiate (thereby making it possible to have a designed probe that is identical to unrelated mRNA sequences); (4) the methods may miss short pieces of identical sequences; and (5) it is difficult to recognize multiple pieces of identical sequences in the gene.
The second method of designing probes that is common in the field involves designing specific probes. Specific probes attempt to find unique sequences among various species and among family genes and among published sequences in the GenBank. A specific probe is a probe that hybridizes with only one particular gene, thereby identifying the presence of that gene for the researcher. The procedure involves first finding the genes of interest (by performing a keyword search against the GenBank or against MEDLINE) and then retrieving all of the relevant genes of interest. A manual homology analysis between the gene of interest and whole sequences in the GenBank can be performed to find common and unique regions. ##STR3##
Next, the researcher would input the sequence of the unique region into the program and then analyze the secondary structure. After this, the researcher would manually select several candidate probes which contain the minimal hairpin structure and specific length according to the user's interest. The researcher would then perform a homology analysis between each candidate probe and all sequences in the GenBank to find all possible cross-hybridizable genes. Lastly, the researcher manually would decide which is the best candidate probe by determining which probe does not have identical sequences in unrelated sequences in the GenBank.
All of the conventional methods for designing specific oligonucleotide probes known to the inventors using currently available computer software have at least four problems: (1) they involve time consuming multiple processes; (2) it is difficult to control the melting temperature Tm of the oligonucleotide probes; (3) the methods do not allow for quantification of uniqueness; and (4) there is no guarantee that the method will design the best possible probe.
None of the methods discussed in the literature discloses a system that may be used to design both common probes and extremely specific probes, especially a method that minimizes user and CPU time and is exceptionally accurate.
Programs currently used for rapid database similarity searches use either hashing strategies or statistical strategies. The hashing strategy is now being used for the detection of relatively short regions of similarity, while the statistical strategy is now being used for the detection of weaker and longer similarity regions. The Mismatch Model of this invention can be used for very strong similarity searches with running times faster than current hashing strategies.
The basic technologies behind the Mismatch Model used in this invention are hashing and continuous seed filtration, each general technology being known in the public domain and having been previously applied separately to non-genetic applications. To the best of the inventors' knowledge, these methods, used together, have never been suggested in other studies on optimal probe selection. The inventors' methods have a program performance of tens of seconds (CPU+I/O time) with a 1000 nucleotide query and all mammalian DNA on a SPARC station, and are even faster on the more common personal computer proposed herein.
The H-Site Model of this invention likewise is unique in that it offers a multitude of information on selected probes and original and distinctive means of visualizing, analyzing and selecting among candidate probes designed with the invention. Candidate probes are analyzed using the H-Site Model for their binding specificity relative to some known set of mRNA or DNA sequences, collected in a database such as the GenBank database. The first step involves selection of candidate probes at some or all the positions along a given target. Next, a melting temperature model is selected, and an accounting is made of how many false hybridizations each candidate probe will produce and what the melting temperature of each will be. Lastly, the results are presented to the researcher along with a unique set of tools for visualizing, analyzing and selecting among the candidate probes.
This invention is both much faster and much more accurate than the methods that are currently in use. It is unique because it is the only method that can find not only the most specific and unique sequence, but also the common sequences. Further, it allows the user to perform many types of analysis on the candidate probes, in addition to comparing those probes in various ways to the target sequences and to each other.
Therefore, it is the object of this invention to provide a practical and user-friendly system that will allow a researcher to design both specific and common oligonucleotide probes, and to do this in less time and with much more accuracy than currently done. For example, the current version of the GenBank contains over ninety (90) million nucleotides. It is thought that the human genome alone consists of three billion base pairs, and scientists have so far managed to decode the base sequence of only about 500 human genes, less than one percent of the total. Currently available searching strategies are limited in how many of the GenBank's sequences can be accessed and successfully searched, and how convenient and feasible such a search would be (in terms of both computer processor and human user time). It is also an object of this invention to allow the user to be able to run the program on more readily available and far less expensive computer hardware (i.e., a PC rather than a mainframe). This invention will remove those limits and allow genetic research to take a giant leap forward.
These and other advantages and objects of this invention will become apparent from the following detailed descriptions, drawings, and appended claims.