1. Field of the Invention
The present invention relates generally to the field of genome-wide gene analysis. More particularly, it concerns the development of a technique wherein longer sequences extended from SAGE tags are generated to analyze gene expression. Furthermore, it concerns the development of a technique wherein extended DNA sequences encoding parts of an isolated protein fragment are generated to identify genes encoding isolated proteins. The invention also provides a high-throughput method for identifying genes encoded by SAGE tags.
2. Description of Related Art
A particular biological event in a cell is largely controlled by the expression of multiple genes, both at the correct time and in a spatially appropriate manner. Monitoring the pattern of gene expression under various physiological and pathological conditions is a critical step in understanding these biological processes and for potential intervention. Because of the large number of genes expressed in higher eukaryotic genomes, powerful tools are needed to characterize the overall pattern of gene expression. The successful development of the SAGE technique (Serial Analysis of Gene Expression) is an important milestone in this regard (Velculescu et al., 1995). In the SAGE technique, a short sequence tag with 10 base nucleotides representing each expressed sequence is excised and the tags from different expressed sequences are ligated for sequencing analysis. This strategy provides maximal coverage of the expressed genes for gene identification at the whole genome level while keeping the sequencing analysis at a manageable scale. Application of the SAGE technique has provided valuable information in various biological systems (Zhang et al., 1997, Velculescu et al., 1997, Madden et al., 1997, Hibi et al., 1998, Hashimoto et al., 1999).
However, there are two problems when applying the SAGE tag sequence for gene identification. The first is that many SAGE tags identified have no match to known sequences in databases (Zhang et al., 1997, Velculescu et al., 1997). These tags may represent potentially novel genes. It is difficult, however, to use this tag information for further characterization of the corresponding genes because of their short length. The second problem is that many SAGE tag sequences have multiple matches with sequences in the databases. These matched sequences have no similarity to each other except that they share the same SAGE tag sequence. This feature makes it difficult to determine the correct sequence in a particular tissue corresponding to a SAGE tag among these matched sequences.