1. Field of the Invention
The present invention relates to methods for the analysis of nucleic acids and the identification of genotypes present in biological samples. More specifically, embodiments of the present invention relate to automated methods for genotyping and analyzing the sequences of nucleic acids. More specifically, embodiments of the present invention relate to methods for genotyping through visual identification of data and the assessment of assays through visual comparison of data and/or quantification of the quality of the assay.
2. Description of Related Art
The detection of nucleic acids is central to medicine, forensic science, industrial processing, crop and animal breeding, and many other fields. The ability to detect disease conditions (e.g., cancer), infectious organisms (e.g., HIV), genetic lineage, genetic markers, and the like, is ubiquitous technology for disease diagnosis and prognosis, marker assisted selection, correct identification of crime scene features, the ability to propagate industrial organisms and many other techniques. Determination of the integrity of a nucleic acid of interest can be relevant to the pathology of an infection or cancer. One of the most powerful and basic technologies to detect small quantities of nucleic acids is to replicate some or all of a nucleic acid sequence many times, and then analyze the amplification products. The polymerase chain reaction, or PCR, is perhaps the most well-known of a number of different amplification techniques.
PCR is a powerful technique for amplifying short sections of DNA. With PCR, one can quickly produce millions of copies of DNA starting from a single template DNA molecule. PCR includes a three phase temperature cycle of denaturation of DNA into single strands, annealing of primers to the denatured strands, and extension of the primers by a thermostable DNA polymerase enzyme. This cycle is repeated so that there are enough copies to be detected and analyzed. In principle, each cycle of PCR could double the number of copies. In practice, the multiplication achieved after each cycle is always less than 2. Furthermore, as PCR cycling continues, the buildup of amplified DNA products eventually ceases as the concentrations of required reactants diminish. For general details concerning PCR, see Sambrook and Russell, Molecular Cloning—A Laboratory Manual (3rd Ed.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y. (2000); Current Protocols in Molecular Biology, F. M. Ausubel et al., eds., Current Protocols, a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons, Inc., (supplemented through 2005) and PCR Protocols A Guide to Methods and Applications, M. A. Innis et al., eds., Academic Press Inc. San Diego, Calif. (1990).
Real-time PCR refers to a growing set of techniques in which one measures the buildup of amplified DNA products as the reaction progresses, typically once per PCR cycle. Monitoring the accumulation of products over time allows one to determine the efficiency of the reaction, as well as to estimate the initial concentration of DNA template molecules. For general details concerning real-time PCR see Real-Time PCR: An Essential Guide, K. Edwards et al., eds., Horizon Bioscience, Norwich, U.K. (2004).
More recently, a number of high throughput approaches to performing PCR and other amplification reactions have been developed, e.g., involving amplification reactions in microfluidic devices, as well as methods for detecting and analyzing amplified nucleic acids in or on the devices. Thermal cycling of the sample for amplification in microfluidic devices is usually accomplished in one of two methods. In the first method, the sample solution is loaded into the device and the temperature is cycled in time, much like a conventional PCR instrument. In the second method, the sample solution is pumped continuously through spatially varying temperature zones. See, e.g., Lagally et al. (Analytical Chemistry 73:565-570 (2001)), Kopp et al. (Science 280:1046-1048 (1998)), Park et al. (Analytical Chemistry 75:6029-6033 (2003)), Hahn et al. (WO 2005/075683), Enzelberger et al. (U.S. Pat. No. 6,960,437) and Knapp et al. (U.S. Patent Application Publication No. 2005/0042639).
Once there are a sufficient number of copies of the original DNA molecule, the DNA can be characterized. One method of characterizing the DNA is to examine the DNA's dissociation behavior as the DNA transitions from double stranded DNA (dsDNA) to single stranded DNA (ssDNA). The process of causing DNA to transition from dsDNA to ssDNA with increasing temperature is sometimes referred to as a “high-resolution temperature (thermal) melt (HRTm)” process, or simply a “high-resolution melt” process. Alternatively, the transition from ssDNA to dsDNA may be observed through various electrochemical methods, which generate a dynamic current as the potential across the system is changed.
Melting profile analysis is an important technique for analyzing nucleic acids. In some methods, a double stranded nucleic acid is denatured in the presence of a dye that indicates whether the two strands are bound or not. Examples of such indicator dyes include non-specific binding dyes such as SYBR® Green I, whose fluorescence efficiency depends strongly on whether it is bound to double stranded DNA. As the temperature of the mixture is raised, a reduction in fluorescence from the dye indicates that the nucleic acid molecule has melted, i.e., unzipped, partially or completely. Thus, by measuring the dye fluorescence as a function of temperature, information is gained regarding the length of the duplex, the GC content or even the exact sequence. See, e.g., Ririe et al. (Anal Biochem 245:154-160, 1997), Wittwer et al. (Clin Chem 49:853-860, 2003), Liew et al. (Clin Chem 50:1156-1164 (2004), Herrmann et al. (Clin Chem 52:494-503, 2006), Knapp et al. (U.S. Patent Application Publication No. 2002/0197630), Wittwer et al. (U.S. Patent Application Publication No. 2005/0233335), Wittwer et al. (U.S. Patent Application Publication No. 2006/0019253), Sundberg et al. (U.S. Patent Application Publication No. 2007/0026421) and Knight et al. (U.S. Patent Application Publication No. 2007/0231799).
An alternative method for analyzing a nucleic acid uses voltammetry to detect electrochemical biosensors to detect nucleic acid hybridization. Electrochemical technology is miniaturizable, accurate, and sensitive with controlled reaction conditions. Both label-free and labeled approaches exist for detecting nucleic acid hybridization. Label-free approaches generally rely on changes to the electrical properties of an interface when bound to a nucleic acid, changes in flexibility between rigid dsDNA and more flexible ssDNA, or electrochemical oxidation of guanine bases. See, e.g., Gooding (Electroanalysis 14:1149-1156, 2002), Gooding et al. (Chem. Commun. 2003:1938-1939, 2003), Mearns et al. (Electroanalysis 18:1971-1981, 2006); Paleck (Electroanalysis 8:7-14, 1996). Labeled approaches for detecting nucleic acid hybridization are more common and well-known than label-free approaches. These approaches generally involve redox active molecules that intercalate between Watson-Crick base pairs of a nucleic acid or in the minor or major grooves of the nucleic acid secondary structure, and thus do not interact with single-stranded nucleic acids. Examples of such redox active molecules include Co(Phen)33+, Co(bpy)33+, and Methylene Blue. See, e.g., Mikkelsen (Electroanalysis 8:15-19, 1996); Erdem et al. (Anal. Chim. Acta 422:139-149, 2000). In some cases, the redox active molecules bind preferentially to either dsDNA or ssDNA. Another alternative method includes attaching a label group, such as a ferrocene group, to the end of a nucleic acid probe, which is immobilized on an electrode surface. See, e.g., Mearns et al. (Electrochemistry 18:1971-1981, 2006); Anne et al. (J. Am. Chem. Soc. 128:542-547, 2006); Lai et al. (Proc. Natl. Acad. Sci. U.S.A. 103:4017-4021, 2006); Fan et al. (Proc. Natl. Acad. Sci. U.S.A. 100:9134-9147, 2003); Xiao et al. (Proc. Natl. Acad. Sci. U.S.A. 103:16677-16680, 2006). The single-stranded probe molecule is flexible enough that the ferrocene group may come within close enough contact with the electrode surface to be oxidized or reduced. However, upon hybridization, the rigid double-stranded nucleic acid molecule stands normal to the electrode surface, and the ferrocene group is sufficiently far from the electrode that it will not be oxidized or reduced.
These systems may all be interrogated through cyclic voltammetry. By applying an electric potential that increases or decreases over time across the system, a variable electric current is generated as the label or DNA molecule is oxidized or reduced. Complete hybridization of the target molecule to the probe molecule will generate a characteristic dynamic profile of current generated versus voltage applied. Incomplete hybridization, which would occur if the target molecule contained a mutant genotype, would result in a differing dynamic profile of current generated versus voltage applied. Thus, different nucleic acid sequences may be distinguished from one another through examination of their respective voltammograms.
Some nucleic acid assays require differentiation between potential genotypes within a class of known genotypes. Generally, for thermal melt analysis, researchers will visually inspect a thermal melt profile to determine the melting temperature of the nucleic acid in the sample. However, some nucleic acid assays require identification of a single nucleotide change where the difference in melting temperature (Tm) between the wild type nucleic acid and a mutant nucleic acid is quite small (e.g. less than 0.25° C.). This level of temperature resolution is difficult to achieve in a visual inspection. Furthermore, visual inspection of thermal melt profiles to determine melting temperature ignores significant additional information contained in the profiles, such as the overall shape and distribution of the profile.
In developing assays and instruments for automated genotype classification, it is important to assess and quantify the expected accuracy and misclassification rate for a particular population, as well as for a particular assay. Quantitative and visual feedback to scientists and engineers is important, as it tells them how well an assay as a whole is expected to perform in the field. Components of the assay, or assay parameters, may include the following: (i) assay reagents responsible for the amplification of DNA segments, including the primers responsible for amplification of DNA segments where mutations of interest may lie; (ii) sensing and control instrumentation for an assay; (iii) reaction conditions under which the sensing and control instrumentation operates; (iv) the process by which raw data is collected from such an assay (e.g. the imaging system to monitor fluorescence, temperature control systems, temperature sensing systems); and (v) algorithms or assay software that transform raw data into an identification of a genotype.
A change in any of the assay parameters has the potential of making the system perform better or worse. It is important to quantify and visualize the improvement or detriment to the system performance resulting from the change. The quantitative indicator would be a measure of how well different genotypes are separated, and a visual representation would be a scatter of points representing different genotypes plotted on a page. A single data point is defined by one or more parameters obtained from a DNA sample of a particular genotype. The parameters may be derived from some sort of analysis on raw signals, whether static or dynamic in nature. Parameters from static raw signals may come from a steady state absorbance value where by an unknown DNA sample fluoresces when the individual strands bind to their complementary tagged allele.
Though a user can visualize the separation of the genotype clusters described by reduced-dimensional data points, it would be useful to quantify their separation defined by the expected misclassification rate. One way to quantify the separation of different genotypes is to compute the ratio of the determinants of the between-class scatter matrix to the within-class scatter matrix as described in U.S. patent application Ser. No. 12/759,415. However, at times, these matrices may be singular yielding a determinant of 0, making this quantifier impossible to compute. Furthermore the optimal quantifier for the separation of genotypes is an error statistic for the assay, as the objective of the scientist or engineer responsible for developing new assays is to minimize this error statistic. More specifically, it is also important to know which of the genotypes are most likely to be misclassified as another genotype. Further, it is important to know an overall expected misclassification rate for the assay.
Genotyping systems that yield dynamic signals may include high resolution thermal melt curves where fluorescence versus temperature or the derivative of fluorescence versus temperature yields a unique signature or profile for each genotype. In this instance, an operator may identify a genotype by visually comparing dynamic temperature profiles that yield a unique signature or profile for each genotype. Typically an operator identifies the genotype by visually comparing these dynamic signatures, not as data points. In U.S. patent application Ser. No. 12/759,415, each dynamic curve is transformed to a correlation vector defined by a set of parameters (usually 3). A correlation vector with three parameters can be plotted as a point in three-dimensional space; likewise, a correlation vector with two parameters can be plotted in two-dimensional space.
It is much simpler to visually classify and observe the degree of separation of different genotypes using a cluster plot of data points rather than looking at the dynamic melt curves from which they were derived. Furthermore, if a data point is not contained within any clusters of previously identified genotypes, this may indicate the discovery of a new mutation or genotype.
Once these parameters or data points are derived, the DNA sample may be identified either visually by the operator or automatically by a computer classification algorithm (as in U.S. patent application Ser. No. 12/759,415). In the genotyping problem, a DNA sample is classified as one of several possible genotypes. Typically, the genotype may either be classified as homozygous wildtype (WT), heterozygous (HE) or homozygous mutant (HM) depending on the alleles that make up the DNA. No mutations in either allele results in a homozygous wildtype genotype. A mutation in only one allele results in a heterozygous genotype and a mutation in both alleles results in a homozygous genotype. Data points from the same genotype form a cluster. In genotyping or any classification problem, it is important for clusters representing different genotypes or classes to be far apart from each other to minimize the likelihood of misclassification. Currently the visual representation of different high resolution melts is displayed as dynamic curves to the user. Representing each dynamic curve as a data point is useful because in a classification problem such as this, the position of a point relative to genotype clusters obtained from a training set of known genotypes tells the user which cluster a DNA sample of unknown genotype likely belongs to along with a level of confidence.
In trying to approximate the misclassification rate of an assay, one may be able to run a limited sample size study of NE experimental DNA samples, make the automated identification of the samples, and derive the percentage of incorrect calls from that process. However, due to the fact that a study is limited in number, the calculation of the misclassification rate in this fashion can result in a misclassification rate that is a multiple of 1/NE, and thus a misclassification rate of less than 1/NE cannot be detected. When rapidly going through multiple iterations of assay, instrument or algorithm design, NE cannot be very large in the interest of saving time and resources in the process of system optimization. Therefore, it is difficult to assess the improvement or detriment to system performance that results from a change without running multiple experiments.
Accordingly, what are desired are methods and systems for data analysis that are capable of visually representing data obtained from genotyping analytic systems in a way that allows an investigator to analyze the set of all genotypes together, or to identify a genotype based on how closely it maps to profiles representing the same genotype and how separate it is from other genotypes. In one instance, the methods and systems may be for high resolution melt analysis and may utilize data obtained from thermal melt analysis. Further, there is a need for a method and system that allows a user to estimate an error statistic for an assay without running numerous experiments using the assay. Further, there is a need for being able to measure and/or visually represent the improvement in an error statistic of an assay when a parameter (e.g., reaction components) is altered, discriminating the resulting thermal melt curves and obtaining DNA sequence information from these melting curves, especially where these thermal melt curves are differentiated by a small temperature range. Also desired are methods and systems for high resolution melt analysis that more accurately identify thermal melt curves that facilitate detection of sequence information for DNA that contain one or more peaks or mutations. Also desired are methods and systems that are capable of more accurately identifying a nucleic acid sequence and discriminating between similar sequences while taking into account both features of the profile as well as the overall shape. Also desired are methods that are capable of rapidly identifying a genotype with minimal intervention and decision-making from the user.