Genomic DNA varies significantly from individual to individual, except in identical siblings. Many human diseases arise from genomic variations. The genetic diversity amongst humans and other life forms explains the heritable variations observed in disease susceptibility. Diseases arising from such genetic variations include Huntington's disease, cystic fibrosis, Duchenne muscular dystrophy, and certain forms of breast cancer. Each of these diseases is associated with a single gene mutation. Diseases such as multiple sclerosis, diabetes, Parkinson's, Alzheimer's disease, and hypertension are much more complex. These diseases may be due to polygenic (multiple gene influences) or multi factorial (multiple gene and environmental influences) causes. Many of the variations in the genome do not result in a disease trait. However, as described above, a single mutation can result in a disease trait. Further, many viral genomes are well known to be very diverse and highly mutagenic where variation may confer certain traits to a virus that alters disease progression or treatment regimens. The ability to scan the human or viral genomes to identify the location of genes or specific mutations which underlie or are associated with the pathology of such diseases is an enormously powerful tool in medicine and human biology.
Several types of sequence variations, including insertions and deletions (indels), differences in the number of repeated sequences, and single base pair differences (SNPs) result in genomic diversity. Single base pair differences, referred to as single nucleotide polymorphisms (SNPs) are the most frequent type of variation in the human genome (occurring at approximately 1 in 103 bases). A SNP is a genomic position at which at least two or more alternative nucleotide alleles occur at a relatively high frequency (greater than 1%) in a population. A SNP may also be a single base (or a few bases) insertion/deletion variant (referred to as “indels”). SNPs are well-suited for studying sequence variation because they are relatively stable (i.e., exhibit low mutation rates) and because single nucleotide variations (including insertions and deletions) can be responsible for inherited traits. It is understood that in the discussion above, the term SNP is also meant to be applicable to “indel” (defined below).
Polymorphisms identified using microsatellite-based analysis, for example, have been used for a variety of purposes. Use of genetic linkage strategies to identify the locations of single Mendelian factors has been successful in many cases (Benomar et al. (1995), Nat. Genet., 10:84-8; Blanton et al. (1991), Genomics, 11:857-69). Identification of chromosomal locations of tumor suppressor genes has generally been accomplished by studying loss of heterozygosity in human tumors (Cavenee et al. (1983), Nature, 305:779-784; Collins et al. (1996), Proc. Natl. Acad. Sci. USA, 93:14771-14775; Koufos et al. (1984), Nature, 309:170-172; and Legius et al. (1993), Nat. Genet., 3:122-126). Additionally, use of genetic markers to infer the chromosomal locations of genes contributing to complex traits, such as type I diabetes (Davis et al. (1994), Nature, 371:130-136; Todd et al. (1995), Proc. Natl. Acad. Sci. USA, 92:8560-8565) has become a focus of research in human genetics.
Although substantial progress has been made in identifying the genetic basis of many human diseases, current methodologies used to develop this information are limited by prohibitive costs and the extensive amount of work required to obtain genotype information from large sample populations. These limitations make identification of complex gene mutations contributing to disorders such as diabetes extremely difficult. Techniques for scanning the human genome to identify the locations of genes involved in disease processes began in the early 1980s with the use of restriction fragment length polymorphism (RFLP) analysis (Botstein et al. (1980), Am. J. Hum. Genet., 32:314-31; Nakamura et al. (1987), Science, 235:1616-22). RFLP analysis involves Southern blotting and other techniques. Southern blotting is both expensive and time-consuming when performed on large numbers of samples, such as those required to identify a complex genotype associated with a particular phenotype. Some of these problems were avoided with the development of polymerase chain reaction (PCR) based microsatellite marker analysis. Microsatellite markers are simple sequence length polymorphisms (SSLPs) consisting of di-, tri-, and tetra-nucleotide repeats.
Other types of genomic analysis are based on use of markers which hybridize with hypervariable regions of DNA having multiallelic variation and high heterozygosity. The variable regions which are useful for fingerprinting genomic DNA are tandem repeats of a short sequence referred to as a mini satellite. Polymorphism is due to allelic differences in the number of repeats, which can arise as a result of mitotic or meiotic unequal exchanges or by DNA slippage during replication.
Each of these current methods has significant drawbacks because they are time consuming and limited in resolution. While DNA sequencing provides the highest resolution, it is also the most expensive method for determining variants. For example, the determination of SNP frequency among a population of 1000 different samples is very expensive using the methods described above and the determination of SNP frequency among a population of 100,000 samples is prohibitive.
The Human Immunodeficiency Virus (generally referred to as HIV) is a disease where the sensitivity provided by DNA sequencing can have a great impact. This is particularly evident where DNA sequencing can provide highly sensitive detection of drug resistant HIV strains early in treatment when the frequency of each drug resistant strain is relatively low in the population.
HIV continues to be a major problem worldwide, even though a plethora of compounds have been approved for treatment. There are currently six classes of drugs, encompassing nucleoside/nucleotide reverse transcriptase inhibitors, non-nucleoside reverse transcriptase inhibitors, protease inhibitors, one fusion inhibitor, one chemokine receptor inhibitor, and one integrase inhibitor. These target four steps in the life cycle of HIV: the reverse transcription of the virion RNA to DNA, the integration of proviral DNA into the host genome, the cleavage of polypeptide viral gene products into functional proteins, and viral entry at the cell surface.
A major problem in the fight against HIV and AIDS is the development of drug resistance mutations. The International AIDS Society currently recognizes 73 amino acid residues where mutations have been associated with drug-resistance, 37 in the protease, 27 in the reverse transcriptase, 7 in the envelope, and 2 in the integrase (Johnson et al., Topics in HIV medicine (2007), 15:119, which is incorporated by reference herein in its entirety for all purposes). The list is updated at least once per year. Some of these mutations are specific to certain drugs, while others, either alone or in combination, lead to drug class resistance.
Due to the error-prone nature of viral reverse transcriptase and the high viral turnover (t½=1-3 days), the HIV genome mutates very rapidly. Reverse transcriptase is estimated to generate, on average, one mutation per replication of the 9.7 Kb genome that does not dramatically affect the ability of the virus to propagate. This leads to the formation of ‘quasispecies’, where many different mutants exist in a dynamic relationship.
HIV patients are benefiting from antiviral treatment in terms of an increased life span, but many experience several episodes of so-called virological failure when they develop resistance to a certain drug or drug combination. Virological failure dramatically decreases the chance of survival (Hennessey et al., AIDS Res Hum Retroviruses (2000), 16:103, which is incorporated by reference herein in its entirety for all purposes).
It has been hypothesized that the presence of resistance-conferring mutations in a subset of the quasispecies found in a single patient before the initiation of treatment will lead to outgrowth of resistant virus and subsequent virological failure (Coffin, Science (1995), which is incorporated by reference herein in its entirety for all purposes). If this is correct, it would be of enormous benefit to detect mutations present at low frequencies before a treatment decision is made. Many laboratories have developed specialized techniques such as oligonucleotide arrays or real time PCR assays, but these are limited to determining the viral genotype at pre-selected nucleotide positions and need adjustment every time new resistance mutations are recognized. Direct sequencing is superior in that it is not biased towards certain nucleotide species/residues and also can identify new drug resistance-linked mutations. Currently, kits based on traditional Sanger sequencing are available to detect drug-resistance mutations, but unless the viral species are first sub-cloned, these systems are unable to reliably detect mutations with a frequency below 25%. Sequencing of sub-cloned species has clearly demonstrated that low-frequency resistance mutations do exist within a single patient-derived viral population and that many of these are missed by standard bulk sequencing (Palmer et al., J. Clin. Microbiol. (2005), 43:406, which is incorporated by reference herein in its entirety for all purposes). Additional work has also shown that approximately 10% of chronically HIV-infected, drug-naïve patients harbor resistance mutations (Novak et al., Clin. Infect. Dis. (2005), 40:468, which is incorporated by reference herein in its entirety for all purposes). It is quite reasonable to assume that these low-frequency, persistent resistance-linked mutations are present before treatment initiation and affect treatment outcome.
Therefore, efficient detection of these mutations through sequencing enables substantial advancement in knowledge of the disease and treatment possibilities from early detection. Further, embodiments of high throughput sequencing techniques enabled for what may be referred to as “Massively Parallel” processing have substantially more powerful analysis, sensitivity, and throughput characteristics than previous sequencing techniques. For example, the high throughput sequencing technologies employing HIV specific primers of the presently described invention are capable of achieving a sensitivity of detection of low abundance alleles that include a frequency of 1% or less of the allelic variants in a population. As described above, this is important in the context of detecting HIV variants, particularly for drug resistant variants where high sensitivity provides an important early detection mechanism that results in a substantial therapeutic benefit.