Genomic copy number information is commonly obtained using whole genome amplification (WGA). The endemic problem with the WGA method is over-sampling of certain regions, yielding a non-uniform amplification of the genome (1). WGA methods begin with the step that initiates the process, a polymerase (Phi 29) makes a strand from genomic DNA utilizing a random primer coupled to an adaptor for subsequent PCR (FIG. 1). If the input DNA strands are referred to as the “0-th derivative”, and the first synthesized strand as a “first derivative,” subsequent strands are called the (n+1)-th derivative if their template was an n-th derivative. Only strands that are 2-nd derivative or higher become amplified in the PCR step, resulting in a ‘stacking’ over the regions ‘chosen’ by the polymerase for the first derivative.
Coverage of the genome by sequencing WGA of single cell DNA is limited by the stacking phenomenon (FIG. 2). Thus it is difficult to obtain single cell measurements, particularly when based on WGA, due to distortions that originate from stochastic sampling and amplification steps. Moreover, the current method of WGA is a black box, with the unspecified reagents purchased from a vendor, which hampers optimization. Moreover, the WGA method does not extend to a method usable for single cell RNA profiling.
Ligation-mediated PCR was developed in an attempt to solve the above-identified problems inherent in WGA. In this method, adaptors are ligated to an MseI restriction endonuclease digest of genomic DNA from a single cell, followed by PCR amplification using primers complementary to the adaptors. The amplified DNA is then used for CGH or DNA sequencing (2,3). However, like WGA, the method still requires an amplification step.
Parameswaran et al. (2007) and U.S. Pat. No. 7,622,281 describe methods of labeling nucleic acid molecules with barcodes for the purpose of identifying the source of the nucleic acid molecules, thereby allowing for high-throughput sequencing of multiple samples (4,5). Eid et al. (2009) describe a single molecule sequencing method wherein single-molecule real time sequencing data is obtained from a DNA polymerase performing uninterrupted template-directed synthesis using four distinguishable fluorescently labeled dNTPs (6). However, these methods do not provide genomic information unaffected by amplification distortion.
Miner et al. (2004) describe a method of molecular barcoding to label template DNA prior to PCR amplification, and report that the method allows for the identification of contaminant and redundant sequences by counting only distinctly tagged sequences (22). U.S. Pat. No. 7,537,897 describes methods for molecular counting by labeling molecules of an input sample with unique oligonucleotide tags and subsequently amplifying and counting the number of different tags (23). Miner et al. and U.S. Pat. No. 7,537,897 both describe labeling of input nucleic acid molecules by ligation, which has been found to be an inefficient reaction.
McCloskey et al. (2007) describe a method of molecular encoding which does not use ligation but instead uses template specific primers to barcode template DNA molecules prior to PCR amplification (24). However, such a method requires that template specific primers be made for each species of template DNA molecule studied.
As described herein, obtaining accurate genomic copy number information by high-throughput sequencing of genomic DNA prepared by WGA methods is hampered by the copy number distortions introduced by non-uniform amplification of genomic DNA. Thus, there exists a need for a method that allows for copy number determination free of distortions caused by amplification steps and which allows for accurate and efficient copy number determination of complex samples. Such a method should also be robust using existing methodologies for high volume, massively parallel sequencing.