The present invention relates to methods for the identification and isolation of nucleic acid fragments. More specifically, the invention covers methods for the identification and isolation of non-redundant mRNAs and novel genomic sequences.
The elucidation of the mechanisms that dictate the normal functioning of living cells requires a detailed understanding of the information encoded in all of the complete genome. Messenger RNA (mRNA) sequences are typically used to map and sequence the genes contained in the genomes of different organisms. The sequence information is used to evaluate the genetic makeup of a particular cell or organism of interest. However, mRNAs are produced at different levels within different cell types and during different points in development. The distribution of mRNA types, their developmental and cell-type specific regulated expression, and their translation into protein produce the unique character of a particular cell type.
There are currently several world-wide research efforts aimed at cloning, mapping, and sequencing the genomes of various organisms, including Homo sapiens. Information from these projects will assist in providing an understanding of how genomes result in the organisms they specify. Furthermore, an understanding of the molecular makeup of normal functioning cells is essential to the understanding of various cellular processesxe2x80x94including the diagnosis and treatment of diseases in which regulation and expression of one or more of the genes has changed.
Integral to this goal is the production of libraries of cloned nucleic acids. Two different types of DNA libraries are typically used in the art. The first type, genomic libraries, are constructed by placing randomly cleaved DNA fragments of an entire genome into a suitable cloning vector. Although some of the clones in a genomic library contain genes or portions of genes, most clones contain non-coding DNA.
The second type of DNA libraries are complementary DNA (cDNA) libraries. These libraries are constructed from DNA that is reverse transcribed from mRNA isolated from a source of interest; cDNA libraries primarily contain DNA that codes for genes. However, different species of mRNA are not equally represented in a given cell. These mRNA molecules are distributed into three frequency classes: (1) superprevalent (consisting of approximately 10-15 mRNAs which, together, represent 10-20% of the total mRNA mass); (2) intermediate (consisting of approximately 1-2,000 mRNAs which, together, represent 40-45% of the total mRNA mass), and (3) complex (consisting of approximately 15-20,000 mRNAs which, together, represent 40-45% of the total mRNA mass). Davidson and Britten, SCIENCE 204: 1052-1059 (1979). Differential levels of mRNAs within a cell is a significant obstacle to the identification and sequencing of low-abundance mRNA species. In the creation of nucleic acid libraries suitable for sequencing, superprevalent mRNAs impede the isolation and analysis of lower abundance mRNAs. Since the majority of clones isolated from a cDNA library will be from superprevalent and intermediate prevalent mRNAs, significant time and effort is spent resequencing previously known prevalent mRNA species, and large numbers of mRNA species must be sequenced in order to isolate and sequence low-abundance mRNA species. Thus, the rate of gene discovery from libraries is limited by the redundant nature of mRNAs present in a given cell. The presence of highly abundant mRNAs also hinders the comparison of differences in active genes observed in different cells of related tissue types, cells in varying stages of development, the effect of stimuli, and differential gene expression between normally functioning and abnormal cells (e.g., cells from normal tissue compared to tumor tissues).
One method for reducing the variation in the abundance of the individual nucleic acid molecules in a library is to produce a normalized library. Two approaches for generation of normalized libraries have been proposed. Weissman, MOL. BIOL. MED. 4: 133-143 (1987). These techniques include (1) hybridization to genomic DNA, in which the frequency of each hybridized cDNA in the normalized library is proportional to that of each corresponding gene in the genomic DNA, and (2) a kinetic approach that relies on the difference in annealing kinetics between abundant and rare species (Galau et al., ARCH. BIOCHEM. BIOPHYS. 179: 584-599 (1977). Several investigators employ the kinetic approach. For example, Soares et al. use single-stranded circles in their approach (see, e.g., U.S. Pat. Nos. 5,846,721 and 5,830,662), while Li et al. use haptenylated nucleic acid molecules (PCT application WO 99/15702). An alternative approach uses reassociation of short double-stranded cDNAs. Ko, NUCLEIC ACIDS RES. 18: 5705-5711 (1990).
Although normalization increases the chance of sequencing low-abundance nucleic acids, at best, the relative concentration of all mRNA species of a normalized library are within one to two orders of magnitude. Accordingly, the super-and intermediate-abundance nucleic acids are well represented in the library. Any attempt to randomly select and sequence clones from a normalized library will result in the selection of a high percentage of previously-characterized high abundance nucleic acid species.
Therefore, a need remains in the art for a method of rapidly and efficiently identifying and discarding previously-identified clones, thereby eliminating the redundancy in a population of nucleic acid molecules. Such a method would avoid the need to continuously re-sequence previously-characterized nucleic acid fragments and would permit the rapid and efficient identification and sequencing of novel genes.
The present invention relates to a highly efficient, high-throughput method for the identification and elimination of redundancy in a population of nucleic acid molecules using microarrays. The method comprises providing a random sample of nucleic acid fragments, immobilizing the random sample of nucleic acid fragments on a microarray, hybridizing one or more labeled probes corresponding to previously arrayed or sequenced fragments, detecting fragments hybridized to the labeled probes and identifying at least one fragment not hybridized or weakly hybridized to the labeled probes; and sequencing an identified fragment that was not hybridized or was weakly hybridized to the labeled probes. The nucleic acid fragments may be RNA or DNA, and may be cloned into a vector or not. In some embodiments, the nucleic acid fragments are members of a cDNA or genomic library, which may be normalized or non-normalized. In other embodiments, the nucleic acid fragments are PCR fragments. In many embodiments, the nucleic acid fragments are amplified, e.g., by PCR.
The nucleic acid fragments are then immobilized to a solid surface, in a microarray. The solid surface is preferably glass. Labeled probes that correspond to previously arrayed or sequenced fragments (i.e., the subtraction probe) are hybridized to the immobilized nucleic acid fragments. Nucleic acid labels may be fluorescent, luminescent, or radioactive labels, biotinylated, haptenated, or other chemical tags which allow for easy detection of labeled probes. Generally, the unhybridized probes are removed. Nucleic acid fragments that are not hybridized or are weakly hybridized to a labeled probe are isolated and are then pooled with the previous set of probes to generate a new, larger probe set. Usually, the newly isolated fragments are sequenced and their sequences compared to those found in a sequence database.
The methods of the present invention involve a subtraction protocol that identifies and isolates non-redundant nucleic acid fragments from a population of nucleic acid molecules. In most embodiments, the protocol is reiterated, in order to create a set of fragments that becomes more biased toward previously uncharacterized genes with each successive round. Accordingly, with each round of subtraction, probes corresponding to the newly isolated fragments are labeled and added to the previous subtraction probe, and this new subtraction probe is hybridized to the next microarray containing randomly picked nucleic acid fragments. This procedure is repeated several times, always adding the newly identified sequences to the previous subtraction probe. Thus, the method allows the identification and isolation of non-redundant or minimally overlapping nucleic acid fragments from sources of interest and enhances the rate of novel gene discovery. In a preferred embodiment, the non-redundant clones that are isolated using the methods of the invention are identified, selected, and immobilized to a new microarray to produce a unigene gene set.
Numerous applications can be envisioned for this invention. Specifically, any application in which the practitioner desires to enrich for sequences of interest or remove undesired nucleic acid fragments is amenable to the methods of the invention. A non-limiting set of uses includes:
I. A microarray-based method for enhancing the rate of discovery of expressed mRNA/cDNA sequences and facilitating construction a xe2x80x9cUnique Genexe2x80x9d set. This method allows for increase novel gene discovery of expressed cDNAs and expedited construction of a Unique Gene set of expressed cDNAs.
II. A microarray-based method for enhancing the rate of discovery of genomic sequences and facilitating isolation of a DNA fragments corresponding to a whole genome or subregions of interest. In this application, the method provides for increased discovery of genomic clones, expedited construction of a set of non redundant or minimally tiled genomic clones, increased discovery of clones mapping to a region of interest, expedited construction of a set of genomic clones in a region of interest in the genome, and expedited filling of gaps in genomic maps (to facilitate disease gene mapping and disease gene identification).
III. A microarray-based method for enrichment and/or isolation of DNA sequences (mRNA/cDNA, genomic, extrachromosomal, plasmid and all other) that are unique to a population compared to another population. The invention also allows for identification of sequences (expressed cDNA or genomic) unique or novel to one organism versus another, including nucleic acid molecules from different strains (i.e. pathogenic vs. non-pathogenic) and different species.
IV. A microarray-based method for increasing discovery of related (or conserved) DNA sequences (mRNA/cDNA, genomic, extrachromosomal, plasmid and all other). Conversely, the method allows for identification of related sequences among closely or distantly related organisms.
V. A microarray-based method for enhancing the rate of removal of undesired sequences. In another embodiment, the invention provides for removal of undesired DNA sequences, including contaminating DNA sequences and sequences closely related to those previously identified.
VI. A microarray-based method for identifying changes in copy number (under or over represented) of DNA sequences (genomic, extrachromosomal, plasmid and all other) between different sources of nucleic acids.