Currently, the sequencing projects, the determination and analysis of the genomic DNA of various living organisms have been in progress all over the world. The whole genomic sequences of more than 40 species of prokaryotes, a lower eukaryote, yeast, a multicellular eukaryote, C. elegans, and a higher plants, arabidopsis, etc. are already determined. For human genome, presumably having 3 billion base pairs, the analysis was advanced under global cooperative organization, and a draft sequence was disclosed in 2001. Moreover, all the structures are to be clear and to be disclosed in 2002-2003. The aim of the determination of genomic sequence is to reveal the functions of all genes and their regulation and to understand living organisms as a network of interactions between genes, proteins, cells or individuals through deducing the information in a genome, which is a blueprint of the highly complicated living organisms. To understand living organisms by utilizing the genomic information from various species is not only important as an academic subject, but also socially significant from the viewpoint of industrial application.
However, determination of genomic sequences itself cannot identify the functions of all genes. For example, as for yeast, only the function of approximately half of the 6000 genes, which is predicted based on the genomic sequence, was able to be deduced. On the other hand, the human genome has been estimated to contain about 30,000-40,000 genes. Further, 100,000 or more types of mRNAs are said to exist when variants produced by alternative splicing are taken into consideration. Therefore, it is desirable to establish “a high throughput analysis system of the gene functions” which allows us to identify rapidly and efficiently the functions of vast amounts of the genes obtained by the genomic sequencing.
Many genes in the eukaryotic genome are split by introns into multiple exons. Thus, it is difficult to predict correctly the structure of encoded protein solely based on genomic information. In contrast, cDNA, which is produced from mRNA that lacks introns, encodes a protein as a single continuous amino acid sequence and allows us to identify the primary structure of the protein easily. In human cDNA research, to date, more than three million ESTs (Expression Sequence Tags) are publicly available, and the ESTs presumably cover not less than 80% of all human genes.
The information of ESTs is utilized for analyzing the structure of human genome, or for predicting the exon-regions of genomic sequences or their expression profile. However, many human ESTs have been derived from proximal regions to the 3′-end of cDNA, and information around the 5′-end of mRNA is extremely little. Among human cDNAs, the number of the corresponding mRNAs whose encoding full-length protein sequences are deduced is approximately 13,000.
It is possible to identify the transcription start site of mRNA on the genomic sequence based on the 5′-end sequence of a full-length cDNA, and to analyze factors involved in the stability of mRNA that is contained in the cDNA, or in its regulation of expression at the translation stage. Also, since a full-length cDNA contains atg codon, the translation start site, in the 5′-region, it can be translated into a protein in a correct frame. Therefore, it is possible to produce a large amount of the protein encoded by the cDNA or to analyze biological activity of the expressed protein by utilizing an appropriate expression system. Thus, analysis of a full-length cDNA provides valuable information which complements the information from genome sequencing. Also, full-length cDNA clones that can be expressed are extremely valuable in empirical analysis of gene function and in industrial application.
Therefore, if a novel human full-length cDNA is isolated, it can be used for developing medicines for diseases in which the gene is involved. The protein encoded by the gene can be used as a drug by itself. Thus, it has great significance to obtain a full-length cDNA encoding a novel human protein.
In particular, human secretory proteins or membrane proteins would be useful by itself as a medicine like tissue plasminogen activator (TPA), or as a target of medicines like membrane receptors. In addition, genes for signal transduction-related proteins (protein kinases, etc.), glycoprotein-related proteins, transcription-related proteins, etc. are genes whose relationships to human diseases have been elucidated. Moreover, genes for disease-related proteins form a gene group rich in genes whose relationships to human diseases have been elucidated.
Therefore, it has great significance to isolate novel full-length cDNA clones of human, only few of which has been isolated. Especially, isolation of-a novel cDNA clone encoding a secretory protein or membrane protein is desired since the protein itself would be useful as a medicine, and also the clones potentially include a gene involved in diseases. In addition, genes encoding proteins that are involved in signal transduction, glycoprotein, transcription, or diseases are expected to be useful as target molecules for therapy, or as medicines themselves. These genes form a gene group predicted to be strongly involved in diseases. Thus, identification of the full-length cDNA clones encoding those proteins has great significance.