The genetic information of all living organisms (e.g., animals, plants and microorganisms) is encoded in deoxyribonucleic acid (DNA). In humans, the complete genome contains of about 100,000 genes located on 24 chromosomes (The Human Genome, T. Strachan, BIOS Scientific Publishers, 1992). Each gene codes for a specific protein, which after its expression via transcription and translation, fulfills a specific biochemical function within a living cell.
A change or variation in the genetic code can result in a change in the sequence or level of expression of mRNA and potentially in the protein encoded by the mRNA. These changes, known as polymorphisms or mutations, can have significant adverse effects on the biological activity of the mRNA or protein resulting in disease. Mutations include nucleotide deletions, insertions, substitutions or other alterations (i.e., point mutations).
Many diseases caused by genetic polymorphisms are known and include hemophilia, thalassemia, Duchenne Muscular Dystrophy (DMD), Huntington's Disease (HD), Alzheimer's Disease and Cystic Fibrosis (CF) (Human Genome Mutations, D. N. Cooper and M. Krawczak, BIOS Publishers, 1993). Genetic diseases such as these can result from a single addition, substitution, or deletion of a single nucleotide in the deoxynucleic acid (DNA) forming the particular gene. In addition to mutated genes, which result in genetic disease, certain birth defects are the result of chromosomal abnormalities such as Trisomy 21 (Down's Syndrome), Trisomy 13 (Patau Syndrome), Trisomy 18 (Edward's Syndrome), Monosomy X (Turner's Syndrome) and other sex chromosome aneuploidies such as Klinefelter's Syndrome (XXY). Further, there is growing evidence that certain DNA sequences can predispose an individual to any of a number of diseases such as diabetes, arteriosclerosis, obesity, various autoimmune diseases and cancer (e.g., colorectal, breast, ovarian, lung).
A change in a single nucleotide between genomes of more than one individual of the same species (e.g., human beings), that accounts for heritable variation among the individuals, is referred to as a “single nucleotide polymorphism” or “SNP.” Not all SNPs result in disease. The effect of an SNP, dependent on its position and frequency of occurrence, can range from harmless to fatal. Certain polymorphisms are thought to predispose some individuals to disease or are related to morbidity levels of certain diseases. Atherosclerosis, obesity, diabetes, autoimmune disorders, and cancer are a few of such diseases thought to have a correlation with polymorphisms. In addition to a correlation with disease, polymorphisms are also thought to play a role in a patient's response to therapeutic agents given to treat disease. For example, polymorphisms are believed to play a role in a patient's ability to respond to drugs, radiation therapy, and other forms of treatment.
Identifying polymorphisms can lead to better understanding of particular diseases and potentially more effective therapies for such diseases. Indeed, personalized therapy regiments based on a patient's identified polymorphisms can result in life saving medical interventions. Novel drugs or compounds can be discovered that interact with products of specific polymorphisms, once the polymorphism is identified and isolated. The identification of infectious organisms including viruses, bacteria, prions, and fungi, can also be achieved based on polymorphisms, and an appropriate therapeutic response can be administered to an infected host.
Since the sequence of about 16 nucleotides is specific on statistical grounds even for the size of the human genome, relatively short nucleic acid sequences can be used to detect normal and defective genes in higher organisms and to detect infectious microorganisms (e.g., bacteria, fungi, protists and yeast) and viruses. DNA sequences can even serve as a fingerprint for detection of different individuals within the same species (see, Thompson, J. S. and M. W. Thompson, eds., Genetics in Medicine, W.B. Saunders Co., Philadelphia, Pa. (1991)).
Several methods for detecting DNA are used. For example, nucleic acid sequences are identified by comparing the mobility of an amplified nucleic acid molecule with a known standard by gel electrophoresis, or by hybridization with a probe, which is complementary to the sequence to be identified. Identification, however, can only be accomplished if the nucleic acid molecule is labeled with a sensitive reporter function (e.g., radioactive (32P, 35S), fluorescent or chemiluminescent). Radioactive labels can be hazardous and the signals they produce decay over time. Non-isotopic labels (e.g., fluorescent) suffer from a lack of sensitivity and fading of the signal when high intensity lasers are being used. Additionally, performing labeling, electrophoresis and subsequent detection are laborious, time-consuming and error-prone procedures. Electrophoresis is particularly error-prone, since the size or the molecular weight of the nucleic acid cannot be directly correlated to the mobility in the gel matrix. It is known that sequence specific effects, secondary structure and interactions with the gel matrix cause artefacts. Moreover, the molecular weight information obtained by gel electrophoresis is a result of indirect measurement of a related parameter, such as mobility in the gel matrix.
Applications of mass spectrometry in the biosciences have been reported (see Meth. Enzymol., Vol. 193, Mass Spectrometry (McCloskey, ed.; Academic Press, NY 1990); McLaffery et al., Acc. Chem. Res. 27:297-386 (1994); Chait and Kent, Science 257:1885-1894 (1992); Siuzdak, Proc. Natl. Acad. Sci., USA 91:11290-11297 (1994)), including methods for mass spectrometric analysis of biopolymers (see Hillenkamp et al. (1991) Anal. Chem. 63:1193A-1202A) and for producing and analyzing biopolymer ladders (see, International Publ. WO 96/36732; U.S. Pat. No. 5,792,664).
MALDI-MS requires incorporation of the macromolecule to be analyzed in a matrix, and has been performed on polypeptides and on nucleic acids mixed in a solid (i.e., crystalline) matrix. In these methods, a laser is used to strike the biopolymer/matrix mixture, which is crystallized on a probe tip, thereby effecting desorption and ionization of the biopolymer. In addition, MALDI-MS has been performed on polypeptides using the water of hydration (i.e., ice) or glycerol as a matrix. When the water of hydration was used as a matrix, it was necessary to first lyophilize or air dry the protein prior to performing MALDI-MS (Berkenkamp et al. (1996) Proc. Natl. Acad. Sci. USA 93:7003-7007). The upper mass limit for this method was reported to be 30 kDa with limited sensitivity (i.e., at least 10 pmol of protein was required).
MALDI-TOF mass spectrometry has been employed in conjunction with conventional Sanger sequencing or similar primer-extension based methods to obtain sequence information, including the detection of SNPs (see, e.g., U.S. Pat. Nos. 5,547,835; 6,194,144; 6,225,450; 5,691,141 and 6,238,871; H. Köster et al., Nature Biotechnol., 14:1123-1128, 1996; WO 96/29431; WO 98/20166; WO 98/12355; U.S. Pat. No. 5,869,242; WO 97/33000; WO 98/54571; A. Braun et al., Genomics, 46:18, 1997; D. P. Little et al., Nat. Med., 3:1413, 1997; L. Haff et al., Genome Res., 7:378, 1997; P. Ross et al., Nat. Biotechnol., 16:1347, 1998; K. Tang et al., Proc. Natl. Acad. Sci. USA, 96:10016, 1999). Since each of the four naturally occurring nucleotide bases dC, dT, dA and dG, also referred to herein as C, T, A and G, in DNA has a different molecular weight: MC=289.2; MT=304.2; MA=313.2; MG=329.2; where MC, MT, MA, MG are average molecular weights (under the natural isotopic distribution) in daltons of the nucleotide bases deoxycytidine, thymidine, deoxyadenosine, and deoxyguanosine, respectively, it is possible to read an entire sequence in a single mass spectrum. If a single spectrum is used to analyze the products of a conventional Sanger sequencing reaction, where chain termination is achieved at every base position by the incorporation of dideoxynucleotides, a base sequence can be determined by calculation of the mass differences between adjacent peaks. For the detection of SNPs, alleles or other sequence variations (e.g., insertions, deletions), variant-specific primer extension is carried out immediately adjacent to the polymorphic SNP or sequence variation site in the target nucleic acid molecule. The mass of the extension product and the difference in mass between the extended and unextended product is indicative of the type of allele, SNP or other sequence variation.
U.S. Pat. No. 5,622,824, describes methods for DNA sequencing based on mass spectrometric detection. To achieve this, the DNA is by means of protection, specificity of enzymatic activity, or immobilization, unilaterally degraded in a stepwise manner via exonuclease digestion and the nucleotides or derivatives detected by mass spectrometry. Prior to the enzymatic degradation, sets of ordered deletions that span a cloned DNA sequence can be created. In this manner, mass-modified nucleotides can be incorporated using a combination of exonuclease and DNA/RNA polymerase. This permits either multiplex mass spectrometric detection, or modulation of the activity of the exonuclease so as to synchronize the degradative process.
U.S. Pat. Nos. 5,605,798 and 5,547,835 provide methods for detecting a particular nucleic acid sequence in a biological sample. Depending on the sequence to be detected, the processes can be used, for example, in methods of diagnosis.
Technologies have been developed to apply MALDI-TOF mass spectrometry to the analysis of genetic variations such as microsatellites, insertion and/or deletion mutations and single nucleotide polymorphisms (SNPs) on an industrial scale. These technologies can be applied to large numbers of either individual samples, or pooled samples to study allelic frequencies or the frequency of SNPs in populations of individuals, or in heterogeneous tumor samples. The analyses can be performed on chip-based formats in which the target nucleic acids or primers are linked to a solid support, such as a silicon or silicon-coated substrate, preferably in the form of an array (see, e.g., K. Tang et al., Proc. Natl. Acad. Sci. USA, 96:10016, 1999). Generally, when analyses are performed using mass spectrometry, particularly MALDI, small nanoliter volumes of sample are loaded onto a substrate such that the resulting spot is about, or smaller than, the size of the laser spot. It has been found that when this is achieved, the results from the mass spectrometric analysis are quantitative. The area under the signals in the resulting mass spectra are proportional to concentration (when normalized and corrected for background). Methods for preparing and using such chips are described in U.S. Pat. No. 6,024,925, co-pending U.S. application Ser. Nos. 08/786,988, 09/364,774, 09/371,150 and 09/297,575; see, also, U.S. application Ser. No. PCT/US97/20195, which published as WO 98/20020. Chips and kits for performing these analyses are commercially available from SEQUENOM, INC. under the trademark MassARRAY™. MassARRAY™ relies on mass spectral analysis combined with the miniaturized array and MALDI-TOF (Matrix-Assisted Laser Desorption Ionization-Time of Flight) mass spectrometry to deliver results rapidly. It accurately distinguishes single base changes in the size of DNA fragments associated with genetic variants without tags.
Although the use of MALDI for obtaining nucleic acid sequence information, especially from DNA fragments as described above, offers the advantages of high throughput due to high-speed signal acquisition and automated analysis off solid surfaces, there are limitations in its application. When the SNP or mutation or other sequence variation is unknown, the variant mass spectrum or other indicator of mass, such as mobility in the case of gel electrophoresis, must be simulated for every possible sequence change of a reference sequence that does not contain the sequence variation. Each simulated variant spectrum corresponding to a particular sequence variation or set of sequence variations must then be matched against the actual variant mass spectrum to determine the most likely sequence change or changes that resulted in the variant spectrum. Such a purely simulation-based approach is time consuming. For example, given a reference sequence of 1000 bases, there exist approximately 9000 potential single base sequence variations. For every such potential sequence variation, one would have to simulate the expected spectra and to match them against the experimentally measured spectra. The problem is further compounded when multiple base variations or multiple sequence variations rather than only single base or sequence variations are present.
Therefore, there is a need to improve the accuracy of SNP, mutation and other sequence variation detection and discovery. Thus, among the objects herein, is an object to improve the accuracy of SNP, mutation and other sequence variation detection and discovery. Also among the objects herein, is an increase in the speed of SNP, mutation and sequence variation detection and discovery.