The invention is in the field of bioinformatics and provides methods and systems for generating optimal reagent oligonucleotides for use in biochemical methods, for comparing and evaluating biological sequences, for providing sequences of biological molecules in a relational format allowing retrieval in a client-server environment, and for creating libraries of DNA hybridization probes.
The invention provides and methods for comparing and evaluating biological sequences, for providing sequences of biological molecules in a relational format allowing retrieval in a client-server environment, and libraries of DNA hybridization probes.
All populations of organisms exhibit genetic diversity. In any particular population, the extent, kind and structure of genetic diversity is influenced by the biological processes of mutation and recombination, as well as the population genetic processes of natural selection and random genetic drift. The effect of these processes depends on population size, subdivision and history, as well as mating patterns. A newly arisen variant may confer an evolutionary advantage or disadvantage, or it may be neutral. Natural selection may remove a disadvantageous variant from a population, drive a favored variation to fixation, or maintain polymorphism due to balancing effects. Loss, fixation or polymorphism of neutral variations may occur due to chance events. (Hartl, D., and Clark, A., Principles of Population Genetics, 2nd Ed., Sinauer Assocs, Inc., Sunderland, Mass. © 1989).
Hybridization methods to score genetic diversity have not realized their potential. A primary cause for this is that software has been unavailable to comprehensively analyze the nucleic acid sequence context of a targeted variation. Because of this, there are differential success rates across laboratories. Laboratories that happen to have researchers who are either lucky or who develop a touch for a method are able to achieve allelic discrimination in some 70 to 90 out of 100 designed assays based on “brute force” approaches alone. Other labs, with less experienced or less lucky researchers, often have little or no success, failing to get even a single assay to perform well. Assessing millions of genetic polymorphisms in tens, hundreds, and thousands of biological samples represents an enormous task.
In order to more efficiently score genetic polymorphism, a number of molecular biology methods have been developed. One method is “single base extension”, a form of nucleotide sequencing. In this method, an oligonucleotide sequencing primer is extended by just one base, and this base is complementary to the targeted variation.
Additional methods include hybridization methods such as oligonucleotide arrays, for example PCT Application WO 99/05324, molecular beacons, Invader, the 5′ nuclease method, and DASH (Howell et al., (1999) Nat. Biotech. 17:87-88). The principle underlying these methods is that an oligonucleotide will bind more strongly to a target DNA sequence when there is perfect, complementary Watson-Crick base pairing compared to when there is one or more mismatches between the oligonucleotide probe and the complementary target sequence. Ideally, probe hybridization should be digital. That is, a probe should always hybridize to its perfectly complementary sequence and never hybridize to sequence that is not perfectly complementary.
Despite the recent completion of drafts of the human genome and other genomes, and the identification of millions of genetic polymorphisms, to date only a tiny fraction of genetic diversity has been studied with respect to medically and commercially important traits. The small number of studied polymorphisms is largely due to the large amount of work required of conventional laboratory methods and processes. One aspect of conventional methods that is particularly labor intensive is the design of assays, and most particularly the design of oligonucleotide primers or probes used therein. Present design methods often results in sub-optimal assays that require extensive laboratory optimization in order to obtain meaningful signals while keeping nonspecific biological background interference, primer dimerazation, and oligonucleotide secondary structure formation to a minimum (Saiki, et al. (1985) Science 37:170-172). This is especially true for methods that use hybridization probes to discriminate among genetic variations.
It would be highly useful to apply SNP scoring and especially hybridization methods to the study of genetic diversity on a large scale. For example, it would be useful to study the association between certain variations and susceptibility or resistance to specific diseases, or to drug response. To accomplish these benefits will require the large-scale design of oligonucleotides to be used as PCR primers, allele-specific hybridization probes, and to perform other functions. Further, it will require storing a vast amount of data in such a way as to ease later querying and retrieval. What is needed is an improved process and methods suitable for large-scale design of genetic diversity assays and systems and methods for organizing large amounts of data used in genetic diversity studies.
The first previous approach is PrimerExpress™ Software from Applied Biosystems. This software functions as a calculator where the user must input each sequence individually. Thus, comprehensive examinations are not performed. This software does not allow specification of the targeted genetic variation, does not automatically examine both the forward and reverse strands of the DNA molecule, does not automatically evaluate primer and probe sequences for more than one model, and does not communicate with a central database. Better software would be process oriented, such that it leads the user through the design process, requiring little user interference. Such software would also operate in batch mode, being able to process a queue of variations.
The second previous approach is MeltCalc software. This is implemented in an Excel spreadsheet. This software functions as a calculator, but also examines some of the surrounding sequence. It appears that this software (PrimeExpress™) does not perform a comprehensive examination, does not automatically examine both the sense and antisense strands of the DNA molecule, does not evaluate primers in addition to probes, is specific to one model, does not communicate with a central database and is not process oriented.
Many molecular biology methods for scoring genetic variation require the use of one or more reagent oligonucleotides. Each of these reagent oligonucleotides performs a separate function and these functions are well known in the art. These functions include, but are not limited to, forward PCR primer, reverse PCR primer, sequencing primer, allele-specific hybridization probe, anchor probe, invader probe, and reporter-probe. Typically, many candidate oligonucleotides can be considered for each function. The problem is to choose typically one oligonucleotide for each function such that the oligonucleotides for all functions perform well in combination to produce excellent allelic discrimination. In addition it is important to design reagent oligonucleotides that are not cross reactive or inhibitory, for example, to minimize primer dimerization or reagent oligonucleotide cross complementarity, so that the biochemical method employed to evaluate target nucleic acid sequences is most efficient.
Prior approaches resulted in sub-optimal assays because only a few of the candidate reagent oligonucleotides were examined one at a time by researchers. This is slow, laborious and resulted in many failed assays and much laboratory time and cost to optimize reaction conditions.