The quantification of RNA expression provides major insights into analysis of cellular metabolism, function, growth and interactions. Although individual RNA species have historically been the subject of these studies, more interest is currently being shown in analysis of the patterns of the simultaneous expression of multiple RNA species of both known and unknown function. This approach allows comparative studies on the patterns of expression between different populations of cells, thereby serving as an indicator of the differences in biochemical activities taking place within these populations. For instance, a single group of cells can be divided up into two or more populations where one group serves as a control and the other part is exposed to drugs, metabolites or different physical conditions. In this way, although the majority of the various species of mRNA show little or no differences in expression levels, certain mRNA species may show dramatic increased or decreased levels of expression compared to the untreated or normal control.
As an example, it has long been known that the application of a phorbol ester (PMA) results in changes in a large number of characteristics of mammalian cells growing in vitro. In an experiment reported by Lockhart et al., (1996, Nature Biotechnology 14; 1675-1680) cells growing in culture were exposed to PMA and at various times afterwards, mRNA was extracted and used to create a library of labeled probes. This material was subsequently hybridized to an array of nucleic acids that was complementary to various mRNA sequences. Significant changes could be seen in both the timing and the amount of induction of various cellular cytokines. On the other hand, so called “house-keeping” genes such as actin and GAPDH remained essentially unaffected by the treatment. This example demonstrates that the various mRNA's can be independently monitored to determine which particular genes may be affected by a treatment.
Natural differences between cell populations can also be examined. For instance, differences in the expression levels of various genes can be observed when cells progress through cell cycles (Cho et al., 1998 Mol Cell 2; 65-73 and Spellman et al., 1998 Mol. Biol. Cell 95; 14863-14868). The gene expression profiles that were generated by these studies validated this approach when significant differences in expression were observed for genes that had previously been characterized as encoding cell cycle related proteins. In addition, the arrays used in these studies comprised nucleic acid sequences that represented the entire genetic complement of the yeast being studied. As such, one of the results of these studies was the observation of a number of genes of previously unknown function that also displayed cell cycle dependent expression. Re-examination of these particular genes by other more conventional methods demonstrated that they were involved in cell cycle progression. Thus, this method was demonstrated as being capable of recognizing genes previously known for differential expression and also for identifying new genes.
The differences between normal and transformed cells have also been a subject of long standing interest. The nature of the particular genes that are either overexpressed or underexpressed relative to normal cells may provide information on the origination, progression or treatment of cancerous cells. Array analysis has been carried out by using RNA from tumor derived cells in comparison with expression from normal cells. In one study by Perou et al (1999 Proc. Nat Acad. Sci. USA 96; 9212-9217) human mammary epithelial cells (HMEC) were compared with specimens from primary breast tumors. Included in this study were responses to various cell factors as well as the results of confluence or senescence in the control cultures. All of these are factors that may be involved or affected by cellular transformation into the cancerous state. The amount of data generated in this type of study is almost overwhelming in its complexity. However distinct patterns or clusters of expression can be observed that are correlated to factors associated with the specimens. Further understanding will also be gained when data is gathered from expression in other tumor types and their untransformed equivalents.
There are two distinct elements in all of the expression studies that employ arrays. The first element is concerned with the preparation of the bank of probes that will be used to bind or capture labeled material that is derived from the mRNAs that are being analyzed. The purpose of these arrays is to provide a multiplicity of individual probes where each probe is located in a discrete spatially defined position. After hybridization of the sample is carried out, the particular amount of sample is measured for each site giving a relative measurement of how much material is present in the sample that has homology with the particular probe that is located at that site. The two most commonly used methods for array assembly operate on two very different scales for synthesis of arrays.
On the simplest level of construction, discrete nucleic acids are affixed to solid matrixes such as glass slides or nylon membranes in a process that is very similar to that employed by ink jet printers (For example, see Okamoto et al., 2000, Nature Biotechnology 18; 438-441). The nature of the probe deposited on the matrix can range from small synthetic oligonucleotides to large nucleic acid segments from clones. Preparation of a cloned segment to be used in this form of array assembly can range from E. coli colonies containing individual clones that are lysed and fixed directly onto a matrix or more elaborately by using individual plasmids as templates for preparation of PCR amplified material. The latter method is preferred due to the higher purity of the nucleic acid product. The choice of a particular probe to be used in the assembly can be directed in the sense that the function and sequence is known. This of course will always be true when oligonucleotides are used as the probes since they must be synthesized artificially. On the other hand, when the probes are derived from larger cloned segments of DNA, they can be used irrespective of knowledge of sequence or function. For instance, a bank of probes that represent the entire yeast genome was used in the studies cited earlier on differential expression during cell cycle progression. For human sequences, the burgeoning growth of the human sequencing project has provided a wealth of sequence information that is constantly expanding. Therefore, a popular source of probes that can be used to detect human transcripts has been Expressed Sequence Tags (ESTs) (Adams et al., 1991 Science 252; 1651-1656). The use of sequences of unknown function has the advantage of a lack of any a priori assumption concerning responsiveness in a comparative study and in fact, the study in itself may serve to identify functionality. At present, filter and glass arrays are commercially available from a number of sources for the analysis of expression from various human tissues, developmental stages and disease conditions. On the other hand, directions for making custom arrays are widely disseminated throughout the literature and over the Internet.
At the other end of the scale in complexity is a process where in situ synthesis of oligonucleotides is carried out directly on a solid matrix using a “masking” technology that is similar to that employed in etching of microcircuits (Pirrung et al., U.S. Pat. No. 5,143,854, hereby incorporated by reference). Since this process can be carried out on a very small microscale, a very large number of different probes can be loaded onto a single “biochip” as a high density array. However, since this method depends upon site-specific synthesis, only oligonucleotides are used and the probes are necessarily of limited size. Also, since directed sequence synthesis is used, sequence information has to be available for each probe. An advantage of this system is that instead of a single probe for a particular gene product, a number of probes from different segments can be synthesized and incorporated into the design of the array. This provides a redundancy of information, establishing that changes in levels of a particular transcript are due to fluctuations in the intended target rather than by transcripts with one or more similar sequences. These “biochips” are commercially available as well as the hardware and software required to read them.
Although solid supports such as plastic and glass have been commonly used for fixation of nucleic acids, porous materials have also been used. For example, oligonucleotides were joined to aldehyde groups in polyacrylamide (Yershov et al., (1996) Proc Nat. Acad. Sci USA 93; 4913-4918) and agarose (Afanassiev et al. (2000) Nucl. Acids Res. 28; e66) to synthesize arrays that were used in hybridization assays.
The second element involved in array analysis is the means by which the presence and amount of labeled nucleic acids bound to the various probes of the array will be detected. There are three levels of use of the target mRNA that can provide signal generation. In the first approach, the native RNA itself can be labeled. This has been carried out enzymatically by phosphorylation of fragmented RNA followed by T4 RNA ligase mediated addition of a biotinylated oligomer to the 5′ ends (Lockhart et al, 1996). This method has the limitation that it entails an overnight incubation to insure adequate joining of labels to the RNA. For chemical labeling of RNA, the fragments can be labeled with psoralen that has been linked to biotin (Lockhart et al, 1996). This method has the disadvantage that the crosslinking that joins the label to the RNA can also lead to intrastrand crosslinking of target molecules reducing the amount of hybridizable material.
In the second approach, rather than labeling the transcript itself, the RNA is used as a template to synthesize cDNA copies by the use of either random primers or by oligo dT primers. Extension of the primers by reverse transcriptase can be carried out in the presence of modified nucleotides, thereby labeling all of the nascent cDNA copies. The modified nucleotides can have moieties attached that generate signals in themselves or they may have moieties suitable for attachment of other moieties capable of generation of signals. Examples of groups that have been used for direct signal generation have been radioactive compounds and fluorescent compounds such as fluorescein, Texas red, Cy3 and Cy 5. Direct signal generation has the advantage of simplicity but has the limitation that in many cases there is reduced efficiency for incorporation of the labeled nucleotides by a polymerase. Examples of groups that have been used for indirect signal generation in arrays are dinitrophenol (DNP) or biotin ligands. Their presence is detected later by the use of labeled molecules that have affinities for these ligands. Avidin or strepavidin specifically bind to biotin moieties and antibodies can be used that are specific for DNP or biotin. These proteins can be labeled themselves or serve as targets for secondary bindings with labeled compounds. Alternatively, when the labeled nucleotides contain chemically active substituents such as allylamine modifications, post-synthetic modification can be carried out by a chemical addition of a suitably labeled ester.
The synthesis of a cDNA copy from an mRNA template essentially results in a one to one molar ratio of labeled product compared to starting material. In some cases there may be limiting amounts of the mRNA being analyzed and for these cases, some amplification of the nucleic acid sequences in the sample may be desirable. This has led to the use of the third approach, where the cDNA copy derived from the original mRNA template is in itself used as a template for further synthesis. A system termed “Transcription Amplification System” (TAS) was described (Kwoh, D. Y. and Gingeras, T. R., 1989, Proc. Nat. Acad. Sci., 86, 1173-1177) in which a target specific oligonucleotide is used to generate a cDNA copy and a second target specific oligonucleotide is used to convert the single stranded DNA into double-stranded form. By inclusion of a T7 promoter sequence into the first oligonucleotide, the double-stranded molecule can be used to make multiple transcription products that are complementary to the original mRNA of interest. The purpose of this system was for amplification of a discrete sequence from a pool of various RNA species. No suggestion or appreciation of such a system for the use of non-discrete primer sequences for general amplification was described in this work.
Multiple RNA transcript copies homologous to the original RNA population has been disclosed by van Gelder et al. in U.S. Pat. No. 5,891,636 where specific reference is given to the utility of such a system for creating a library of various gene products in addition to discrete sequences. Since each individual mRNA molecule has the potential for ultimately being the source of a large number of complementary transcripts, this system enjoys the advantages of linear amplification such that smaller amounts of starting material are necessary compared to direct labeling of the original mRNA or its cDNA copy.
However, the work described in U.S. Pat. No. 5,891,636 specifically teaches away from addition of exogenous primers for synthesis of a 2nd strand. Instead, it discloses the use of oligonucleotide primers for production of only the first strand of cDNA. For synthesis of the second strand, two possible methods were disclosed. In the first method, the nicking activity of RNase H on the original mRNA template was used to create primers that could use the cDNA as a template. In the second method, DNA polymerase was added to form hairpins at the end of the first cDNA strand that could provide self-priming. The first method has a limitation that RNase H has to be added after the completion of the cDNA synthesis reaction and a balance of RNase H activity has to be determined to provide sufficient nicking without total degradation of potential RNA primers. The second method requires an extra step of incubation a different polymerase besides the Reverse Transcriptase and also S1 nuclease has to be added to eliminate the loop in the hairpin structure. In addition, the formation and extension by foldback is a poorly understood system that does not operate at high efficiency where sequences and amounts of cDNA copies may act as random factors.
In addition to the amplification provided by the use of RNA transcription, PCR has been included in some protocols to carry out synthesis of a library through the use of common primer binding sites at each end of individual sequences (Endege et al., 1999 Biotechniques 26; 542-550, Ying et al., 1999 Biotechniques 27; 410-414). These methods share the necessity for a machine dedicated to thermal cycling.
In addition to binding analytes from a library, the nucleic acids on an array can use the analytes as templates for primer extension reactions. For instance, determination of Single Nucleotide Polymorphisms, (SNP's) has been carried out by the use of a set of primers at different sites on the array that exhibit sequence variations from each other (Pastinen et al., 2000, Genome Research 10; 1031-1042). The ability or inability of a template to be used for primer extension by each set of primers is an indication of the particular sequence variations within the analytes. More complex series of reactions have also been carried out by the use of arrays as platforms for localized amplification as described in U.S. Pat. No. 5,641,658 and Weslin et al., 2000, Nature Biotechnology 18; 199-204. In these particular applications of array technology, PCR and SDA were carried out by providing a pair of unique primers for each individual nucleic acid target at each locus of the array. The presence or absence of amplification at each locus of the array served as an indicator of the presence or absence of the corresponding target sequences in the analyte samples.
Despite the accelerated development of the synthesis and use of DNA microarrays in recent years, the progress in the development of arrays of proteins or other ligands has been significantly slower even though such arrays are an ideal format with which to study gene expression, as well as antibody-antigen, receptor-ligand, protein-protein interactions and other applications. In previous art, protein arrays have been used for gene expression antibody screening, and enzymatic assays (Lueking et al. (1999) Anal. Biochem. 270; 103-111; de Wildt et al., (2000) Nature Biotechnology 18; 989-994, Arenkov et al., (2000) Analytical Biochemistry 278; 123-131). Protein arrays have also been used for high throughput ELISA assays (Mendoza et al., (1999) Biotechniques 27; 778-788) and for the detection of individual proteins in complex solutions (Haab, et al.; (2001) Genome Biology 2; 1-13). However, the use thus far has been limited because of the inherent problems associated with proteins. DNA is extremely robust and can be immobilized on a solid matrix, dried and rehydrated without any loss of activity or function. Proteins, however, are far more difficult to utilize in array formats. One of the main problems of using proteins in an array format is the difficulty of applying the protein to a solid matrix in a form that would allow the protein to be accessible and reactive without denaturing or otherwise altering the peptide or protein. Also, many proteins cannot be dehydrated and must be kept in solution at all times, creating further difficulties for use in arrays.
Some methods which have been used to prepare protein arrays include placing the proteins on a polyacrylamide gel matrix on a glass slide that has been activated by treatment with glutaraldehyde or other reagents (Arenkov, op. cit.). Another method has been the addition of proteins to aldehyde coated glass slides, followed by blocking of the remaining aldehyde sites with BSA after the attachment of the desired protein. This method, however, could not be used for small proteins because the BSA obscured the protein. Peptides and small proteins have been placed on slides by coating the slides with BSA and then activating the BSA with N,N′-disuccinimidyl carbonate (Taton et al., (2000) Science 2789, 1760-1763). The peptides were then printed onto the slides and the remaining activated sites were blocked with glycine, Protein arrays have also been prepared on poly-L-Lysine coated glass slides (Haab et al., op. cit.) and agarose coated glass slides (Afanassiev et al., (2000) Nucleic Acids Research 28, e66). “Protein Chips” are also commercially available from Ciphergen (Fremont, Calif.) for a process where proteins are captured onto solid surfaces and analyzed by mass spectroscopy.
The use of oligonucleotides as ‘hooks’ or ‘tags’ as identifiers for non-nucleic acid molecules has been described in the literature. For instance, a library of peptides has been made where each peptide is attached to a discrete nucleic acid portion and members of the library are tested for their ability to bind to a particular analyte. After isolation of the peptides that have binding affinities, identification was carried out by PCR to “decode” the peptide sequence (Brenner. and Lerner, (1992) Proc. Nat. Acad. Sci. USA 89; 5381-5383, Needels et al., (1993) Proc. Nat. Acad. Sci. USA 90; 10,700-10,704). Nuceleic acid sequences have also been used as tags in arrays where selected oligonucleotide sequences were added to primers used for single nucleotide polymorphism genotyping (Hirschhorn, et al., (2000) Proc. Natl. Acad. Sc. USA, 97; 12164-12169). However, in this case the ‘tag’ is actually part of the primer design and it is used specifically for SNP detection using a single base extension assay. A patent application filed by Lohse, et al., (WO 00/32823) has disclosed the use of DNA-protein fusions for protein arrays. In this method, the protein is synthesized from RNA transcripts which are then reverse transcribed to give the DNA sequences attached to the corresponding protein. This system lacks flexibility since the technology specifically relates only to chimeric molecules that comprise a nucleic acid and a peptide or protein. In addition, the protein is directly derived from the RNA sequence so that the resultant DNA sequence is also dictated by the protein sequence. Lastly, every protein that is to be used in an array requires the use of an in vitro translation system made from cell extracts, a costly and inefficient system for large scale synthesis of multiple probes. The use of electrochemically addressed chips for use with chimeric compositions has also been described by Bazin and Livache 1999 in “Innovation and Perspectives in solid Phase Synthesis & Recombinatorial Libraries” R. Epton (Ed.) Mayflower Scientific Limited, Birmingham, UK.