The information content of the genome is carried as deoxyribonucleic acid (DNA). The size and composition of a given genomic sequence determines the form and function of the resultant organism. In general, genomic complexity is proportional to the complexity of the organism. Relatively simple organisms such as bacteria have genomes of about 1-5 million megabases while mammalian genomes are approximately 3000 megabases. The genome is generally divided into distinct segments known as chromosomes. The bacterium Escherichia coli (E. coli) contains a single circular chromosome, whereas the human genome consists of 24 chromosomes.
Genomic DNA exists as a double-stranded polymer containing four DNA bases (A, G, C, and T) tethered to a sugar-phosphate backbone. The order of the bases along the DNA is the primary sequence of the DNA. The genome of an organism contains both protein coding and non-coding regions, including exons and introns, promoter and gene regulatory regions, and non-functional DNA. Genome analysis can provide a quantitative measure of gene copy number and chromosome number, as well as the presence of single base differences in the primary sequence of the DNA. Single base changes that are inherited are referred to as polymorphisms, whereas those that are acquired during the life of an organism are known as mutations. Genomic analysis at the DNA level does not provide a measure of gene expression (that is, the process by which RNA and protein copies of the coding sequences are synthesized).
All of the cells from a given organism are assumed to contain identical genomes, while genomes from different individuals of the same species are typically about 99.9% identical. The 0.1% polymorphism rate among individuals (Wang et al., Science 280: 1077 (1998)) is significant in that approximately three million polymorphisms are expected to be found upon complete sequencing of any two human genomes. If single base changes occur in protein coding segments, polymorphisms can alter the protein sequence and therefore change the biochemical activity of the protein.
The DNA genome consists of discrete functional regions known as genes. Genomes of simple organisms such as bacteria contain approximately 1000 genes (Fleischmann et al., Science 269: 496 (1995)), whereas the human genome is estimated to contain about 100,000 genes (Fields et al., Nature Genet. 7: 345 (1994)). Genomic analysis at the mRNA level can be used as a measure of gene expression. Expression levels for each gene are determined by a combination of genetic and environmental factors. The genetic factors include the precise DNA sequence of gene regulatory regions such as promoters, enhancers, and splice sites. Polymorphisms in the DNA are thus expected to contribute some of the differences in gene expression among individuals of the same species. Expression levels are also affected by environmental factors, including temperature, stress, light, and signals that lead to changes in the levels of hormones and other signaling substances. For this reason, RNA analysis provides information not only about the genetic potential of an organism, but also about changes in functional state (M. Schena and R. W. Davis, DNA Microarrays: A Practical Approach. (Oxford University Press, New York, 1999) 1-16.)
The second step in gene expression is the synthesis of protein from mRNA. A unique protein is encoded by each mRNA, such that every three nucleotides of mRNA encodes one amino acid of the polypeptide chain, with the linear order of the nucleotides represented as a linear sequence of amino acids. Once synthesized, the protein assumes a unique three-dimensional conformation that is determined largely by the primary amino acid sequence. Proteins impart the functional instructions of the genome by performing a wide range of biochemical activities including roles in gene regulation, metabolism, cell structure, and DNA replication.
Individuals in a population may have differences in protein activity due to polymorphisms that either alter the primary amino acid sequence of the proteins or perturb steady state protein levels by altering gene expression. Similar to mRNA levels, protein levels can also change in response to changes in the environment; moreover, protein levels are also subject to translational and post-translational control which do not effect mRNA levels directly (Schena and David, 1999). Proteomics analysis provides data on when or if a predicted gene product is actually translated, the level and type of post-translational modification it may undergo and its relative concentration compared with other proteins (Humphrey-Smith and Blackstock, J. Protein. Chem. 16: 537-544 (1997)). After DNA is transcribed into mRNA, the exons may be spliced in different ways before being translated into proteins. Following the translation of mRNA by ribosomes, proteins are usually post-translationally modified by the addition of different chemical groups such as carbohydrate, lipid and phosphate groups, as well as through the proteolytic cleavage of specific peptide bonds. These chemical modifications are crucial to modulating protein function but are not directly coded for by genes. Furthermore, both mRNA and protein are continually being synthesized and degraded, and thus final levels of protein are not easily obtainable by measuring mRNA levels (Patton, J. Chromatogr. 722: 203-223, (1999); Patton et al., J. Biol. Chem. 270: 21404-21410 (1995)). So while mRNA levels are often extrapolated to indicate the levels of expressed proteins, it is not surprising that there is little correlation between the abundance of mRNA species and the actual amounts of proteins that they code for (Anderson and Seilhamer, Electrophoresis 18: 533-537; Gygi et al., Mol. Cell. Biol. 19: 1720-1730 (1999)).
A growing body of evidence suggests that changes in gene and protein expression may correlate with the onset of a given human disease (Schena and Davis, 1999). Proteomic analysis of disease tissues should allow the identification of proteins whose expression is altered in a given illness. Many small molecules may also alter protein expression at a global level. Combining information about altered expression in a disease state with the changes that result from treatment with a small molecule would provide valuable information about classes of molecules that may be effective in combating a given disease. Proteomics thus has a role in processes such as lead compound screening and optimization, toxicity, pharmacodynamics, and drug efficacy.
A pivotal component of proteomics is its ability to accurately quantify vast numbers of proteins accurately and reproducibly. Typically, proteomics entails the simultaneous separation of proteins from a biological sample, and the quantitation of the relative abundance of the proteins resolved during the separation. Proteomics currently relies heavily on two-dimensional (2-D) gel electrophoresis. However, obtaining information concerning global protein expression using 2-D gels is technically difficult, and semiautomated procedures to carry out this process are in their infancy (Patton, Biotechniques 28: 944-957 (2000)). Furthermore, the commonly used stains for evaluating protein expression in 2-D gels (such as Coomassie Blue, colloidal gold and silver stain) do not provide the requisite dynamic range to be effective in this capacity. These stains are linear over only a 10- to 40-fold range, whereas the abundance of individual proteins differs by as much as four orders of magnitude (Brush, The Scientist 12:16-22, 1998; Wirth and Romano, J. Chromatogr 698: 123-143 (1995)). In addition, low abundance proteins, such as transcription factors and kinases that are present in 1-2000 copies per cell, often represent species that perform important regulatory functions. The accurate detection of such low-abundance proteins is an important challenge to proteomics. Methods have recently been introduced to directly quantify the relative abundance of proteins in two different samples by mass spectrometry. However, the linear dynamic range of these methods has been demonstrated over only a four- to ten-fold range (Gygi et al. 1999; Oda et al., Proc. Natl. Acad. Sci USA 96: 6591-6596 (1999)).
It has been noted that developing microarray technologies would make possible the simultaneous, ultra-sensitive measurement of hundreds or even thousands of substances in a small sample (Ekins, Clin. Chem. 44: 2015-2030 (1998)). This approach has been difficult to put into practice, however, because the extremely small volumes (about 0.5-5 nl) of sample used to create spots on these microarrays makes it necessary to utilize methods of analyte detection that are extremely sensitive. Rolling Circle Amplification (RCA) driven by DNA polymerase can replicate circular oligonucleotide probes with either linear or geometric kinetics under isothermal conditions (Lizardi et al., Nature Genet. 19: 225-232 (1998)). If a single primer is used, RCA generates in a few minutes a linear chain of hundreds or thousands of tandemly-linked DNA copies of a target which is covalently linked to that target. Generation of a linear amplification product permits both spatial resolution and accurate quantitation of a target. DNA generated by RCA can be labeled with fluorescent oligonucleotide tags that hybridize at multiple sites in the tandem DNA sequences. RCA can be used with fluorophore combinations designed for multiparametric color coding (Speicher et al., Nature Genet. 12:368-375 (1996)), thereby markedly increasing the number of targets that can be analyzed simultaneously. RCA technologies can be used in solution, in situ and in microarrays. In solid phase formats, detection and quantitation can be achieved at the level of single molecules (Lizardi et al., 1998).