1. Field of the Invention
The invention relates generally to the field of acquiring, manipulating and using genetic data for medically predictive purposes, and specifically to a system in which imperfectly measured genetic data is made more precise by using known genetic data of genetically related individuals, thereby allowing more effective identification of genetic irregularities that could result in various phenotypic outcomes.
2. Description of the Related Art
Current methods of prenatal diagnosis can alert physicians and parents to abnormalities in growing fetuses. Without prenatal diagnosis, one in 50 babies is born with serious physical or mental handicap, and as many as one in 30 will have some form of congenital malformation. Unfortunately, standard methods require invasive testing and carry a roughly 1 percent risk of miscarriage. These methods include amniocentesis, chorion villus biopsy and fetal blood sampling. Of these, amniocentesis is the most common procedure; in 2003, it was performed in approximately 3% of all pregnancies, though its frequency of use has been decreasing over the past decade and a half. A major drawback of prenatal diagnosis is that given the limited courses of action once an abnormality has been detected, it is only valuable and ethical to test for very serious defects. As result, prenatal diagnosis is typically only attempted in cases of high-risk pregnancies, where the elevated chance of a defect combined with the seriousness of the potential abnormality outweighs the risks. A need exists for a method of prenatal diagnosis that mitigates these risks.
It has recently been discovered that cell-free fetal DNA and intact fetal cells can enter maternal blood circulation. Consequently, analysis of these cells can allow early Non-Invasive Prenatal Genetic Diagnosis (NIPGD). A key challenge in using NIPGD is the task of identifying and extracting fetal cells or nucleic acids from the mother's blood. The fetal cell concentration in maternal blood depends on the stage of pregnancy and the condition of the fetus, but estimates range from one to forty fetal cells in every milliliter of maternal blood, or less than one fetal cell per 100,000 maternal nucleated cells. Current techniques are able to isolate small quantities of fetal cells from the mother's blood, although it is very difficult to enrich the fetal cells to purity in any quantity. The most effective technique in this context involves the use of monoclonal antibodies, but other techniques used to isolate fetal cells include density centrifugation, selective lysis of adult erythrocytes, and FACS. Fetal DNA isolation has been demonstrated using PCR amplification using primers with fetal-specific DNA sequences. Since only tens of molecules of each embryonic SNP are available through these techniques, the genotyping of the fetal tissue with high fidelity is not currently possible.
Much research has been done towards the use of pre-implantation genetic diagnosis (PGD) as an alternative to classical prenatal diagnosis of inherited disease. Most PGD today focuses on high-level chromosomal abnormalities such as aneuploidy and balanced translocations with the primary outcomes being successful implantation and a take-home baby. A need exists for a method for more extensive genotyping of embryos at the pre-implantation stage. The number of known disease associated genetic alleles is currently at 389 according to OMIM and steadily climbing. Consequently, it is becoming increasingly relevant to analyze multiple embryonic SNPs that are associated with disease phenotypes. A clear advantage of pre-implantation genetic diagnosis over prenatal diagnosis is that it avoids some of the ethical issues regarding possible choices of action once undesirable phenotypes have been detected.
Many techniques exist for isolating single cells. The FACS machine has a variety of applications; one important application is to discriminate between cells based on size, shape and overall DNA content. The FACS machine can be set to sort single cells into any desired container. Many different groups have used single cell DNA analysis for a number of applications, including prenatal genetic diagnosis, recombination studies, and analysis of chromosomal imbalances. Single-sperm genotyping has been used previously for forensic analysis of sperm samples (to decrease problems arising from mixed samples) and for single-cell recombination studies.
Isolation of single cells from human embryos, while highly technical, is now routine in in vitro fertilization clinics. To date, the vast majority of prenatal diagnoses have used fluorescent in situ hybridization (FISH), which can determine large chromosomal aberrations (such as Down syndrome, or trisomy 21) and PCR/electrophoresis, which can determine a handful of SNPs or other allele calls. Both polar bodies and blastomeres have been isolated with success. It is critical to isolate single blastomeres without compromising embryonic integrity. The most common technique is to remove single blastomeres from day 3 embryos (6 or 8 cell stage). Embryos are transferred to a special cell culture medium (standard culture medium lacking calcium and magnesium), and a hole is introduced into the zona pellucida using an acidic solution, laser, or mechanical drilling. The technician then uses a biopsy pipette to remove a single visible nucleus. Clinical studies have demonstrated that this process does not decrease implantation success, since at this stage embryonic cells are undifferentiated.
There are three major methods available for whole genome amplification (WGA): ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primer PCR (DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR, short DNA sequences called adapters are ligated to blunt ends of DNA. These adapters contain universal amplification sequences, which are used to amplify the DNA by PCR. In DOP-PCR, random primers that also contain universal amplification sequences are used in a first round of annealing and PCR. Then, a second round of PCR is used to amplify the sequences further with the universal primer sequences. Finally, MDA uses the phi-29 polymerase, which is a highly processive and non-specific enzyme that replicates DNA and has been used for single-cell analysis. Of the three methods, DOP-PCR reliably produces large quantities of DNA from small quantities of DNA, including single copies of chromosomes. On the other hand, MDA is the fastest method, producing hundred-fold amplification of DNA in a few hours. The major limitations to amplification material from a single cells are (1) necessity of using extremely dilute DNA concentrations or extremely small volume of reaction mixture, and (2) difficulty of reliably dissociating DNA from proteins across the whole genome. Regardless, single-cell whole genome amplification has been used successfully for a variety of applications for a number of years.
There are numerous difficulties in using DNA amplification in these contexts. Amplification of single-cell DNA (or DNA from a small number of cells, or from smaller amounts of DNA) by PCR can fail completely, as reported in 5-10% of the cases. This is often due to contamination of the DNA, the loss of the cell, its DNA, or accessibility of the DNA during the PCR reaction. Other sources of error that may arise in measuring the embryonic DNA by amplification and microarray analysis include transcription errors introduced by the DNA polymerase where a particular nucleotide is incorrectly copied during PCR, and microarray reading errors due to imperfect hybridization on the array. The biggest problem, however, remains allele drop-out (ADO) defined as the failure to amplify one of the two alleles in a heterozygous cell. ADO can affect up to more than 40% of amplifications and has already caused PGD misdiagnoses. ADO becomes a health issue especially in the case of a dominant disease, where the failure to amplify can lead to implantation of an affected embryo. The need for more than one set of primers per each marker (in heterozygotes) complicate the PCR process. Therefore, more reliable PCR assays are being developed based on understanding the ADO origin. Reaction conditions for single-cell amplifications are under study. The amplicon size, the amount of DNA degradation, freezing and thawing, and the PCR program and conditions can each influence the rate of ADO.
All those techniques, however, depend on the minute DNA amount available for amplification in the single cell. This process is often accompanied by contamination. Proper sterile conditions and microsatellite sizing can exclude the chance of contaminant DNA as microsatellite analysis detected only in parental alleles exclude contamination. Studies to reliably transfer molecular diagnostic protocols to the single-cell level have been recently pursued using first-round multiplex PCR of microsatellite markers, followed by real-time PCR and microsatellite sizing to exclude chance contamination. Multiplex PCR allows for the amplification of multiple fragments in a single reaction, a crucial requirement in the single-cell DNA analysis. Although conventional PCR was the first method used in PGD, fluorescence in situ hybridization (FISH) is now common. It is a delicate visual assay that allows the detection of nucleic acid within undisturbed cellular and tissue architecture. It relies firstly on the fixation of the cells to be analyzed. Consequently, optimization of the fixation and storage condition of the sample is needed, especially for single-cell suspensions.
Advanced technologies that enable the diagnosis of a number of diseases at the single-cell level include interphase chromosome conversion, comparative genomic hybridization (CGH), fluorescent PCR, and whole genome amplification. The reliability of the data generated by all of these techniques rely on the quality of the DNA preparation. PGD is also costly, consequently there is a need for less expensive approaches, such as mini-sequencing. Unlike most mutation-detection techniques, mini-sequencing permits analysis of very small DNA fragments with low ADO rate. Better methods for the preparation of single-cell DNA for amplification and PGD are therefore needed and are under study. The more novel microarrays and comparative genomic hybridization techniques, still ultimately rely on the quality of the DNA under analysis.
Several techniques are in development to measure multiple SNPs on the DNA of a small number of cells, a single cell (for example, a blastomere), a small number of chromosomes, or from fragments of DNA. There are techniques that use Polymerase Chain Reaction (PCR), followed by microarray genotyping analysis. Some PCR-based techniques include whole genome amplification (WGA) techniques such as multiple displacement amplification (MDA), and MOLECULAR INVERSION PROBES (MIPs) that perform genotyping using multiple tagged oligonucleotides that may then be amplified using PCR with a singe pair of primers. An example of a non-PCR based technique is fluorescence in situ hybridization (FISH). It is apparent that the techniques will be severely error-prone due to the limited amount of genetic material which will exacerbate the impact of effects such as allele drop-outs, imperfect hybridization, and contamination.
Many techniques exist which provide genotyping data. TAQMAN is a unique genotyping technology produced and distributed by Applied Biosystems. TAQMAN uses polymerase chain reaction (PCR) to amplify sequences of interest. During PCR cycling, an allele specific minor groove binder (MGB) probe hybridizes to amplified sequences. Strand synthesis by the polymerase enzymes releases reporter dyes linked to the MGB probes, and then the TAQMAN optical readers detect the dyes. In this manner, TAQMAN achieves quantitative allelic discrimination. Compared with array based genotyping technologies, TAQMAN is quite expensive per reaction (˜$0.40/reaction), and throughput is relatively low (384 genotypes per run). While only 1 ng of DNA per reaction is necessary, thousands of genotypes by TAQMAN requires microgram quantities of DNA, so TAQMAN does not necessarily use less DNA than microarrays. However, with respect to the IVF genotyping workflow, TAQMAN is the most readily applicable technology. This is due to the high reliability of the assays and, most importantly, the speed and ease of the assay (˜3 hours per run and minimal molecular biological steps). Also unlike many array technologies (such as 500 k AFFMETRIX arrays), TAQMAN is highly customizable, which is important for the IVF market. Further, TAQMAN is highly quantitative, so anueploidies could be detected with this technology alone.
ILLUMINA has recently emerged as a leader in high-throughput genotyping. Unlike AFFMETRIX, ILLUMINA genotyping arrays do not rely exclusively on hybridization. Instead, ILLUMINA technology uses an allele-specific DNA extension step, which is much more sensitive and specific than hybridization alone, for the original sequence detection. Then, all of these alleles are amplified in multiplex by PCR, and then these products hybridized to bead arrays. The beads on these arrays contain unique “address” tags, not native sequence, so this hybridization is highly specific and sensitive. Alleles are then called by quantitative scanning of the bead arrays. The Illlumina GOLDEN GATE assay system genotypes up to 1536 loci concurrently, so the throughput is better than TAQMAN but not as high as AFFMETRIX 500 k arrays. The cost of ILLUMINA genotypes is lower than TAQMAN, but higher than AFFMETRIX arrays. Also, the ILLUMINA platform takes as long to complete as the 500 k AFFMETRIX arrays (up to 72 hours), which is problematic for IVF genotyping. However, ILLUMINA has a much better call rate, and the assay is quantitative, so anueploidies are detectable with this technology. ILLUMINA technology is much more flexible in choice of SNPs than 500 k AFFMETRIX arrays.
One of the highest throughput techniques, which allows for the measurement of up to 250,000 SNPs at a time, is the AFFMETRIX GeneChip 500K genotyping array. This technique also uses PCR, followed by analysis by hybridization and detection of the amplified DNA sequences to DNA probes, chemically synthesized at different locations on a quartz surface. A disadvantage of these arrays are the low flexibility and the lower sensitivity. There are modified approaches that can increase selectivity, such as the “perfect match” and “mismatch probe” approaches, but these do so at the cost of the number of SNPs calls per array.
Pyrosequencing, or sequencing by synthesis, can also be used for genotyping and SNP analysis. The main advantages to pyrosequencing include an extremely fast turnaround and unambiguous SNP calls, however, the assay is not currently conducive to high-throughput parallel analysis. PCR followed by gel electrophoresis is an exceedingly simple technique that has met the most success in preimplantation diagnosis. In this technique, researchers use nested PCR to amplify short sequences of interest. Then, they run these DNA samples on a special gel to visualize the PCR products. Different bases have different molecular weights, so one can determine base content based on how fast the product runs in the gel. This technique is low-throughput and requires subjective analyses by scientists using current technologies, but has the advantage of speed (1-2 hours of PCR, 1 hour of gel electrophoresis). For this reason, it has been used previously for prenatal genotyping for a myriad of diseases, including: thalassaemia, neurofibromatosis type 2, leukocyte adhesion deficiency type I, Hallopeau-Siemens disease, sickle-cell anemia, retinoblastoma, Pelizaeus-Merzbacher disease, Duchenne muscular dystrophy, and Currarino syndrome.
Another promising technique that has been developed for genotyping small quantities of genetic material with very high fidelity is MOLECULAR INVERSION PROBES (MIPs), such as AFFMETRIX's GENFLEX Arrays. This technique has the capability to measure multiple SNPs in parallel: more than 10,000 SNPS measured in parallel have been verified. For small quantities of genetic material, call rates for this technique have been established at roughly 95%, and accuracy of the calls made has been established to be above 99%. So far, the technique has been implemented for quantities of genomic data as small as 150 molecules for a given SNP. However, the technique has not been verified for genomic data from a single cell, or a single strand of DNA, as would be required for pre-implantation genetic diagnosis.
The MIP technique makes use of padlock probes which are linear oligonucleotides whose two ends can be joined by ligation when they hybridize to immediately adjacent target sequences of DNA. After the probes have hybridized to the genomic DNA, a gap-fill enzyme is added to the assay which can add one of the four nucleotides to the gap. If the added nucleotide (A, C, T, G) is complementary to the SNP under measurement, then it will hybridize to the DNA, and join the ends of the padlock probe by ligation. The circular products, or closed padlock probes, are then differentiated from linear probes by exonucleolysis. The exonuclease, by breaking down the linear probes and leaving the circular probes, will change the relative concentrations of the closed vs. the unclosed probes by a factor of 1000 or more. The probes that remain are then opened at a cleavage site by another enzyme, removed from the DNA, and amplified by PCR. Each probe is tagged with a different tag sequence consisting of 20 base tags (16,000 have been generated), and can be detected, for example, by the AFFMETRIX GENFLEX Tag Array. The presence of the tagged probe from a reaction in which a particular gap-fill enzyme was added indicates the presence of the complimentary amino acid on the relevant SNP.
The molecular biological advantages of MIPS include: (1) multiplexed genotyping in a single reaction, (2) the genotype “call” occurs by gap fill and ligation, not hybridization, and (3) hybridization to an array of universal tags decreases false positives inherent to most array hybridizations. In traditional 500K, TAQMAN and other genotyping arrays, the entire genomic sample is hybridized to the array, which contains a variety of perfect match and mismatch probes, and an algorithm calls likely genotypes based on the intensities of the mismatch and perfect match probes. Hybridization, however, is inherently noisy, because of the complexities of the DNA sample and the huge number of probes on the arrays. MIPs, on the other hand, uses multiplex probes (i.e., not on an array) that are longer and therefore more specific, and then uses a robust ligation step to circularize the probe. Background is exceedingly low in this assay (due to specificity), though allele dropout may be high (due to poor performing probes).
When this technique is used on genomic data from a single cell (or small numbers of cells) it will—like PCR based approaches—suffer from integrity issues. For example, the inability of the padlock probe to hybridize to the genomic DNA will cause allele dropouts. This will be exacerbated in the context of in-vitro fertilization since the efficiency of the hybridization reaction is low, and it needs to proceed relatively quickly in order to genotype the embryo in a limited time period. Note that the hybridization reaction can be reduced well below vendor-recommended levels, and micro-fluidic techniques may also be used to accelerate the hybridization reaction. These approaches to reducing the time for the hybridization reaction will result in reduced data quality.
Once the genetic data has been measured, the next step is to use the data for predictive purposes. Much research has been done in predictive genomics, which tries to understand the precise functions of proteins, RNA and DNA so that phenotypic predictions can be made based on genotype. Canonical techniques focus on the function of Single-Nucleotide Polymorphisms (SNP); but more advanced methods are being brought to bear on multi-factorial phenotypic features. These methods include techniques, such as linear regression and nonlinear neural networks, which attempt to determine a mathematical relationship between a set of genetic and phenotypic predictors and a set of measured outcomes. There is also a set of regression analysis techniques, such as Ridge regression, log regression and stepwise selection, that are designed to accommodate sparse data sets where there are many potential predictors relative to the number of outcomes, as is typical of genetic data, and which apply additional constraints on the regression parameters so that a meaningful set of parameters can be resolved even when the data is underdetermined. Other techniques apply principal component analysis to extract information from undetermined data sets. Other techniques, such as decision trees and contingency tables, use strategies for subdividing subjects based on their independent variables in order to place subjects in categories or bins for which the phenotypic outcomes are similar. A recent technique, termed logical regression, describes a method to search for different logical interrelationships between categorical independent variables in order to model a variable that depends on interactions between multiple independent variables related to genetic data. Regardless of the method used, the quality of the prediction is naturally highly dependant on the quality of the genetic data used to make the prediction.
Normal humans have two sets of 23 chromosomes in every diploid cell, with one copy coming from each parent. Aneuploidy, a cell with an extra or missing chromosomes, and uniparental disomy, a cell with two of a given chromosome that originate from one parent, are believed to be responsible for a large percentage of failed implantations, miscarriages, and genetic diseases. When only certain cells in an individual are aneuploid, the individual is said to exhibit mosaicism. Detection of chromosomal abnormalities can identify individuals or embryos with conditions such as Down syndrome, Klinefelters syndrome, and Turner syndrome, among others, in addition to increasing the chances of a successful pregnancy. Testing for chromosomal abnormalities is especially important as mothers age: between the ages of 35 and 40 it is estimated that between 40% and 50% of the embryos are abnormal, and above the age of 40, more than half of the embryos are abnormal.
Karyotyping, the traditional method used for the prediction of aneuploides and mosaicism is giving way to other more high throughput, more cost effective methods. One method that has attracted much attention recently is Flow cytometry (FC) and fluorescence in situ hybridization (FISH) which can be used to detect aneuploidy in any phase of the cell cycle. One advantage of this method is that it is less expensive than karyotyping, but the cost is significant enough that generally a small selection of chromosomes are tested (usually chromosomes 13, 18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); in addition, FISH has a low level of specificity. Using FISH to analyze 15 cells, one can detect mosaicism of 19% with 95% confidence. The reliability of the test becomes much lower as the level of mosaicism gets lower, and as the number of cells to analyze decreases. The test is estimated to have a false negative rate as high as 15% when a single cell is analysed. There is a great demand for a method that has a higher throughput, lower cost, and greater accuracy.
Listed here is a set of prior art which is related to the field of the current invention. None of this prior art contains or in any way refers to the novel elements of the current invention. In U.S. Pat. No. 6,720,140, Hartley et al describe a recombinational cloning method for moving or exchanging segments of DNA molecules using engineered recombination sites and recombination proteins. In U.S. Pat. No. 6,489,135 Parrott et al. provide methods for determining various biological characteristics of in vitro fertilized embryos, including overall embryo health, implantability, and increased likelihood of developing successfully to term by analyzing media specimens of in vitro fertilization cultures for levels of bioactive lipids in order to determine these characteristics. In US Patent Application 20040033596 Threadgill et al. describe a method for preparing homozygous cellular libraries useful for in vitro phenotyping and gene mapping involving site-specific mitotic recombination in a plurality of isolated parent cells. In U.S. Pat. No. 5,994,148 Stewart et al. describe a method of determining the probability of an in vitro fertilization (IVF) being successful by measuring Relaxin directly in the serum or indirectly by culturing granulosa lutein cells extracted from the patient as part of an IVF/ET procedure. In U.S. Pat. No. 5,635,366 Cooke et al. provide a method for predicting the outcome of IVF by determining the level of 11β-hydroxysteroid dehydrogenase (11β-HSD) in a biological sample from a female patient. In U.S. Pat. No. 7,058,616 Larder et al. describe a method for using a neural network to predict the resistance of a disease to a therapeutic agent. In U.S. Pat. No. 6,958,211 Vingerhoets et al. describe a method wherein the integrase genotype of a given HIV strain is simply compared to a known database of HIV integrase genotype with associated phenotypes to find a matching genotype. In U.S. Pat. No. 7,058,517 Denton et al. describe a method wherein an individual's haplotypes are compared to a known database of haplotypes in the general population to predict clinical response to a treatment. In U.S. Pat. No. 7,035,739 Schadt at al. describe a method is described wherein a genetic marker map is constructed and the individual genes and traits are analyzed to give a gene-trait locus data, which are then clustered as a way to identify genetically interacting pathways, which are validated using multivariate analysis. In U.S. Pat. No. 6,025,128 Veltri et al. describe a method involving the use of a neural network utilizing a collection of biomarkers as parameters to evaluate risk of prostate cancer recurrence.
The cost of DNA sequencing is dropping rapidly, and in the near future individual genomic sequencing for personal benefit will become more common. Knowledge of personal genetic data will allow for extensive phenotypic predictions to be made for the individual. In order to make accurate phenotypic predictions high quality genetic data is critical, whatever the context. In the case of prenatal or pre-implantation genetic diagnoses a complicating factor is the relative paucity of genetic material available. Given the inherently noisy nature of the measured genetic data in cases where limited genetic material is used for genotyping, there is a great need for a method which can increase the fidelity of, or clean, the primary data.