Currently, sequencing projects, the determination and analysis of the genomic DNA of various living organisms are in progress all over the world. The whole genomic sequences of more than 10 species of prokaryotes, a lower eukaryote, yeast, and a multicellular eukaryote, C. elegans have been already determined. As to the human genome, which is supposed to be composed of three thousand million base pairs, world wide cooperative projects are under way to analyze it, and the whole structure is predicted to be determined by the years 2002-2003. The aim of the determination of genomic sequence is to reveal the functions of all genes and their regulation and to understand living organisms as a network of interactions between genes, proteins, cells or individuals through deducing the information in a genome, which is viewed as a blueprint of the highly complicated living organisms. To understand living organisms by utilizing the genomic information from various species is not only important as an academic subject, but also socially significant from the viewpoint of industrial application.
However, determination of genomic sequences itself cannot identify the functions of all genes. For example, for yeast, the function of only approximately half of the 6000 genes, which is predicted based on the genomic sequence, has been deduced. As for humans, the number of genes is predicted to be approximately one hundred thousand. Therefore, it is desirable to establish “a high throughput analysis system of gene functions” which allows us to identify rapidly and efficiently the functions of vast amounts of the genes obtained by the genomic sequencing.
Many genes in the eukaryotic genome are split by introns into multiple exons. Thus, it is difficult to predict correctly the structure of encoded proteins solely based on genomic information. In contrast, cDNA, which is produced from mRNA that lacks introns, encodes a protein as a single continuous amino acid sequence and allows us to identify the primary structure of the protein easily. In human cDNA research, to date, more than one million ESTs (Expression Sequence Tags) are available from public domains (public databases), and the ESTs presumably cover not less than 80% of all human genes.
The information of ESTs is utilized for analyzing the structure of human genome, or for predicting the exon-regions of genomic sequences or their expression profile. However, many human ESTs have been derived from proximal regions to the 3′-end of cDNA, and information around the 5′-end of mRNA is extremely little. Among these human cDNAs, the number of the corresponding mRNAs whose encoding protein sequences are deduced is approximately 7000, and further, the number of full-length clones is only 5500. Thus, even including cDNA registered as EST, the percentage of human cDNA obtained so far is estimated to be 10–15% of all the genes.
It is possible to identify the transcription start site of mRNA on the genomic sequence based on the 5′-end sequence of a full-length cDNA, and to analyze factors involved in the stability of mRNA that is contained in the cDNA, or in its regulation of expression at the translation stage. Also, since a full-length cDNA contains ATG, the translation start site, in the 5′-region, it can be translated into a protein in a correct frame. Therefore, it is possible to produce a large amount of the protein encoded by the cDNA or to analyze biological activity of the expressed protein by utilizing an appropriate expression system. Thus, analysis of a full-length cDNA provides valuable information that complements the information from genome sequencing. Also, full-length cDNA clones that can be expressed are extremely valuable in empirical analysis of gene function and in industrial application.
In particular, human secretory proteins or membrane proteins would be useful by itself as a medicine like tissue plasminogen activator (TPA), or as a target of medicines like membrane receptors.
Therefore, it has great significance to isolate novel full-length cDNA clones of humans, of which only a few have been isolated. Especially, isolation of a novel cDNA clone encoding a secretory protein or membrane protein is desired since the protein itself, or a molecule that interacts with the membrane protein would be useful as a medicine, and also the clones potentially include a gene associated with diseases. Thus, identification of the full-length cDNA clones encoding those proteins has great significance.