Progress in the human genome project has seeded the need to (i) analyse the expression characteristics of genes and gene products and (ii) analyse the variations in genes and genomes. This has precipitated great interest in methods for large-scale, parallel studies. Interest in developing new methods for detecting variation has further been fuelled by the success of using DNA markers in finding genes for monogenic inherited disorders and recent proposals on large-scale association studies for dissecting complex traits. There is also a need for large-scale studies and high-throughput screening in the search for drugs in the pharmaceutical industry.
This interest in large scale studies may also in the future extend to other areas such as the semiconductor industry where the emergence of devices based on organic molecules such as poly(p-phenylene vinylidene), PPV, and the nascent fields of molecular electronics and nanotechnology seed the demand for new molecules with novel or desirable features and this in turn may see the need to turn to large scale searching.
In the biotechnology and pharmaceutical sector, large scale studies are preferably done either in homogeneous assays on a microtitre plate (96 well and 384 well plates are common and higher capacity plates are available) or in an array format. Spatially addressable arrays (where the sequence identity of a molecule is specified by the location of the element in which the molecule is contained, within the array of elements) of chemical or biochemical species have found wide use in genetics, biology, chemistry and materials science. Arrays can be formed in (i) a disperse solid phase such as beads and bundled hollow fibres/optical fibres, (ii) individual wells of microtitre plates/nanovials or (iii) on a homogeneous medium/surface on which individual elements can be spatially addressed. The latter types of arrays (iii) can be made on semi-permeable materials such as gels, gel pads, porous silicon, microchannel arrays (so called 3-D biochips) (Benoit et al; Anal. Chem 2001 73:2412-2420) and impermeable supports such as silicon wafers, glass, gold coated surfaces, ceramics and plastics. They can also be made within the walls of microfluidic channels (Gao et al; Nucleic Acids Res. 2001 29: 4744-4750). Furthermore the surface or sub-surface may comprise a functional layer such as an electrode.
All elements in arrays of type (i) and (iii) are contained within a single reaction volume, whilst each element of (ii) is contained in a separate reaction volume.
To date, methods have involved analysing the reactions of molecules in bulk. Although bulk or ensemble approaches have in the past proved useful, there are barriers to progress in a number of directions. The results generated are usually an average of millions of reactions where multiple events, multi-step events and variations from the average cannot be resolved and detection methods that are adapted for high frequency events are insensitive to rare events. The practical limitations associated with bulk analysis include the following:
1. The techniques used for the detection of events in bulk phase analysis are not sensitive enough to detect rare events which may be due to low sample amount or weak interaction with probes.
a. Detecting the presence of rare transcripts in mRNA profiling. This problem is related to the limited dynamic range of bulk analysis which is in the order of 104 whereas the different abundance levels of mRNAs in a cell are in the 105 range. Hence to cater for the more common events, detection methods are not sensitive enough to detect rare events.b. In the amounts of samples that are usually available to perform genetic analysis there are not enough copies of each sequence in genomic DNA to be detected. Therefore the Polymerase Chain Reaction (PCR) is used to increase the amount of material from genomic DNA so that sufficient signal for detection can be obtained from the desired loci.c. Due to secondary structure around certain target loci very few hybridisation events go to completion. The few that do, need to be detected. These events may be too few to be detected by conventional bulk measurements.d. The number of analyte molecules in the sample is vanishingly small. For example, in pre-implantation analysis a single molecule must be analysed. In analysis of ancient DNA the amount of sample material available is often also very small.2. A rare event in a background of common events at a particular locus is impossible to detect in the bulk phase due to it being masked by the more common events. There are a number of instances where this is important:a. Detecting loss of heterozygosity (LOH) in tumours comprising mixed cell populations and early events in tumourigenesis.b. Determining minimal residual disease in patients with cancer and early detection of relapse by detecting mutation within a wild type background.c. Prenatal diagnosis of genetic disorders directly from the small number of foetal cells in the maternal circulation (hence detection from mother's blood rather than from amniocentesis).d. Detection of specific alleles in pooled population samples.3. It is difficult to resolve heterogeneous events. For example it is difficult to separate out the contribution (or the lack of) to signal from errors such as foldback, mis-priming or self-priming from genuine signals based on the interactions being measured.4. Complex samples such as genomic DNA and mRNA populations pose difficulties.a. One problem is cross reactions of analyte species within the sample.b. On arrays, another is the high degree of erroneous interactions which in many cases are likely to be due to mismatch interactions driven by high effective concentrations of certain species. This is one reason for low signal to noise. A ratio as low as 1:12 has been used in published array studies for base calling (Cronin et al, Human Mutation 7:244-55, 1996).c. In some cases erroneous interactions can even be responsible for the majority of signal (Mir, K; D. Phil thesis, Oxford University, 1995).d. Detecting a true representative signal of a rare mRNA transcript within a mRNA population is difficult.e. PCR is used in genetic analysis to reduce the complexity of sample from genomic DNA, so that the desired loci become enriched.5. The bulk nature of conventional methods does not allow access to specific characteristics (particularly, more than one feature) of individual molecules. One example in genetic analysis is the need to obtain genetic phase or haplotype information—the specific alleles associated with each chromosome. Bulk analysis cannot resolve haplotype from a heterozygotic sample. Current molecular biology techniques that are available, such as allele-specific or single molecule PCR are difficult to optimise and apply on a large scale.6. Transient processes are difficult to resolve. This is needed when deciphering the molecular mechanisms of processes. Also transient molecular binding events (such as nucleation of a hybridisation event which is blocked from propagation due to secondary structure in the target) have fractional occupancy times which cannot be detected by conventional solid-phase binding assays.
When two samples are compared, small differences in concentration (less than twofold difference) are difficult to unequivocally discern.
Microarray gene expression analysis using unamplified cDNA target typically requires 106 cells or 100 micrograms of tissue. Neither expression analysis nor analysis of genetic variation can be performed directly on material obtained from a single cell which would be advantageous in a number of cases (e.g. analysis of mRNA from cells in early development or genomic DNA from sperm).
Further, it would be highly desirable if the amplification processes that are required before most biological or genetic analysis could be avoided.
PCR is used for the analysis of Variable Number of Tandem Repeats is central to Forensics and Paternity testing. Linkage studies have traditionally used Short Tandem repeats as markers analysis which is performed by PCR.
The need to avoid PCR is particularly acute in the large scale analysis of SNPs. The need to design primers and perform PCR on a large number of SNP sites presents a major drawback. The largest scales of analysis that are currently being implemented (e.g. using Orchid Bioscience and Sequenom systems) remain too expensive to allow meaningful association studies to be performed by all but a few large organizations such as the Pharmaceutical companies. Although, the number of SNPs needed for association studies has been actively debated, the highest estimates are being revised down due to recent reports that there are large blocks of linkage disequilibrium within the genome. Hence, the number of SNPS needed to represent the diversity in the genome could be 10 fold fewer than was expected. However, this needs to be taken with the caveat that there are some regions of the genome where the extent of linkage disequilibrium is far lower and a greater number of SNPs would be needed to represent the diversity in these areas. Even so, if each site had to be amplified individually the task would be enormous. In practice, PCR can be multiplexed. However, the extent to which this can be done is limited and increased errors, such as primer-dimer formation and mismatches as well as the increased viscosity of reaction, present barriers to success and limits multiplexing to around ten sites in most laboratories.
It is clear that the cost of performing SNP detection reactions on the scale required for high-throughput analysis of polymorphisms in a population is prohibitive if each reaction needs to be conducted separately, or if only a limited multiplexing possibility exists. A highly multiplexed, simple and cost-effective route to SNP analysis will be required if the potential of pharmacogenomics, pharmacogenetics as well as large-scale genetics is to be realised. DNA pooling is a solution for some aspects of genetic analysis but accurate allele frequencies must be obtained which is difficult especially for rare alleles.
Since it involves determining the association of a series of alleles along a single chromosome, the haploype is thought to be far more informative than the analysis of individual SNP. An international effort is underway for making a comprehensive haplotype map of the human genome. Generally, haplotypes are determined is by long-range allele specific PCR. However, the construction of somatic cell hybrids prior to haplotype determination is an alternative method.
A method for haplotyping on single molecules in solution has been proposed in patent (WO 01/90418), however, in this method the molecules are not surface captured, positional information of the SNP is not obtained and each SNP must be coded with a different colour. For several years, plans for large scale SNP analysis have been laid around the common disease-common variant (CD/CV) (i.e. common SNP) hypothesis of complex diseases (Reich D E and Lander E S Trends Genet 17: 502-50 2001)). The SNP consortium has amassed more than a million putatively common SNPs. However practical use of this set is confounded by the fact that different SNPs may be common in different ethnic populations and many of the putative SNPs may not be truly polymorphic. Furthermore, the CD/CV hypothesis has recently come under challenge from assertions that rare alleles may contribute to the common diseases (Weiss K M, Clark A G, Trends Genet 2002 January; 18(1):19-24). If this were the case, although “new” rare alleles would be sufficiently in linkage disequilibrium with a common SNP for the association with the region that contains both to be successfully made, if the allele was “ancient” and rare then the common SNPs and haplotype maps would not represent the diversity. In this scenario alternative strategies are needed to find causative regions. Instead of genome-wide scan of common SNPs it may be that there will be a need for whole genome sequencing or re-sequencing of thousands of case and control samples to access all variants. The commercial sequencing of the human genome, which built on information from the public genome project, cost approximately 300 million dollars over a period of about one year. This cost and timescale is prohibitive as an alternative to SNP analysis for finding associations between DNA sequence and disease. Clearly, if sequencing is to replace current approaches to large scale genetic studies, radically different methods are needed.
It would be advantageous if sequencing runs could be on the scale of genomes or at least small genomes or whole genes. Even increasing read-lengths beyond 300-500 nt would be useful. Today, sequencing is almost exclusively done by the Sanger dideoxy method. A number of alternative sequencing methods have been suggested but none are in use today. These methods include:
1 Sequencing by synthesis
2 Direct analysis of the sequence of a single molecule
3 Sequencing by Hybridisation
Re-sequencing by chip methods is an alternative to de-novo sequencing. The 21.7 million bases of non-repetitive sequence of chromosome 21 has recently been re-sequenced by chip methods by Patil et al (Science 294: 1719-1722, 2001). The haplotype structure was conserved in this study by making somatic cell hybrids prior to chip analysis. However, the cost of large scale re-sequencing by this method is still high and only 65% of the bases that were probed gave results of enough confidence for the base to be called.