The ultimate goal of the Human Genome Project is to sequence the entire human genome. The expected outcome of this effort is a precise map of the 70,000–100,000 genes that are expressed in man. However, a fairly complete inventory of human coding sequences will most likely be publicly available sooner. Since the early 1980s, a large number of Expressed Sequence Tags (ESTs), which are partial DNA sequences read from the ends of complementary DNA (cDNA) molecules, have been obtained by both government and private research organizations. A hallmark of these endeavors, carried out by a collaboration between Washington University Genome Sequencing Center and members of the IMAGE (Integrated Molecular Analysis of Gene Expression) consortium (http:/www-bio.llnl.gov/bbrp/image/image.html), has been the rapid deposition of the sequences into the public domain and the concomitant availability of the sequence-tagged cDNA clones from several distributors (Marra, et al. (1998) Trends Genet. 14 (1):4–7). At present, the collection of cDNAs is believed to represent approximately 50,000 different human genes expressed in a variety of tissues including liver, brain, spleen, B-cells, kidney, muscle, heart, alimentary tract, retina, and hypothalamus, and the number is growing daily.
Recent initiatives like that of the Cancer Genome Anatomy project support an effort to obtain full-length sequences of clones in the Unigene set (a set of cDNA clones that is publicly available) by the year 1999. At the same time, commercial entities propose to validate 40,000 full-length cDNA clones by 1999. These individual clones will then be available to any interested party. The speed by which the coding sequences of novel genes are identified is in sharp contrast to the rate by which the function of these genes is elucidated. Assigning functions to the cDNAs in the databases, or functional genomics, is a major challenge in biotechnology today.
For decades, novel genes were identified as a result of research designed to explain a biological process or hereditary disease and the function of the gene preceded its identification. In functional genomics, coding sequences of genes are first cloned and sequenced and the sequences are then used to find functions. Although other organisms such as Drosophila, C. elegans, and Zebrafish are highly useful for the analysis of fundamental genes, animal model systems are inevitable for complex mammalian physiological traits (blood glucose, cardiovascular disease, inflammation). However, the slow rate of reproduction and the high housing costs of the animal models are a major limitation to high throughput functional analysis of genes. Although labor intensive efforts are made to establish libraries of mouse strains with chemically or genetically mutated genes in a search for phenotypes that allow the elucidation of gene function or that are related to human diseases, a systematic analysis of the complete spectrum of mammalian genes, be it human or animal, is a significant task.
In order to keep pace with the volume of sequence data, the field of functional genomics needs the ability to perform high throughput analysis of true gene function. Recently, a number of techniques have been developed that are designed to link tissue and cell specific gene expression to gene function. These include cDNA microarraying and gene chip technology and differential display messenger RNA (mRNA). Serial Analysis of Gene Expression (SAGE) or differential display of mRNA can identify genes that are expressed in tumor tissue but are absent in the respective normal or healthy tissue. In this way, potential genes with regulatory functions can be separated from the excess of ubiquitously expressed genes that have a less likely chance to be useful for small drug screening or gene therapy projects. Gene chip technology has the potential to allow the monitoring of gene expression through the measurement of mRNA expression levels in cells of a large number of genes in only a few hours. Cells cultured under a variety of conditions can be analyzed for their mRNA expression patterns and compared. Currently, DNA microarray chips with 40,000 non-redundant human genes are produced and are planned to be on the market in 1999 (Editorial (1998) Nat. Genet. 18(3):195–7.). However, these techniques are primarily designed for screening cancer cells and not for screening for specific gene functions.
Double or triple hybrid systems also are used to add functional data to the genomic databases. These techniques assay for protein-protein, protein-RNA, or protein-DNA interactions in yeast or mammalian cells (Brent and Finley (1997) Annu. Rev. Genet. 31:663–704). However, this technology does not provide a means to assay for a large number of other gene functions such as differentiation, motility, signal transduction, and enzyme and transport activity. Yeast expression systems have been developed which are used to screen for naturally secreted and membrane proteins of mammalian origin (Klein, et al. (1996) Proc. Natl. Acad. Sci. USA 93 (14):7108–13). This system also allows for collapsing of large libraries into libraries with certain characteristics that aid in the identification of specific genes and gene products. One disadvantage of this system is that genes encoding secreted proteins are primarily selected. A second disadvantage is that the library may be biased because the technology is based on yeast as a heterologous expression system and there will be gene products that are not appropriately folded.
Other current strategies include the creation of transgenic mice or knockout mice. A successful example of gene discovery by such an approach is the identification of the osteoprotegerin gene. DNA databases were screened to select ESTs with features suggesting that the cognate genes encoded secreted proteins. The biological functions of the genes were assessed by placing the corresponding full-length cDNAs under the control of a liver-specific promoter. Transgenic mice created with each of these constructs consequently have high plasma levels of the relevant protein. Subsequently, the transgenic animals were subjected to a battery of qualitative and quantitative phenotypic investigations. One of the genes that was transfected into mice produced mice with an increased bone density, which led subsequently to the discovery of a potent anti-osteoporosis factor (Simonet, et al. (1997) Cell. 89(2):309–19). The disadvantages of this method are that the method is costly and highly time consuming.
The challenge in functional genomics is to develop and refine all the above-described techniques and integrate their results with existing data in a well-developed database that provides for the development of a picture of how gene function constitutes cellular metabolism and a means for this knowledge to be put to use in the development of novel medicinal products. The current technologies have limitations and do not necessarily result in true functional data. Therefore, there is a need for a method that allows for direct measurement of the function of a single gene from a collection of genes (gene pools or individual clones) in a high throughput setting in appropriate in vitro assay systems and animal models.
The development of high throughput screens is discussed in Jayawickreme and Kost, (1997) Curr. Opin. Biotechnol. 8:629–634. A high throughput screen for rarely transcribed differentially expressed genes is described in von Stein et al., (1997) Nucleic Acids Res. 35: 2598–2602. High throughput genotyping is disclosed in Hall et al., (1996) Genome Res. 6:781–790. Methods for screening transdominant intracellular effector peptides and RNA molecules are disclosed in Nolan, WO97/27212 and WO/9727213.