Analyzing messenger RNA (“mRNA”) expressed in organisms is a very important and useful approach to obtaining various biochemical findings. This is because proteins of eucaryotes are generated by translation of the mRNA. It is known that mRNA, which is transcribed and generated from DNA, is reduced by a process known as splicing, that is, a number of cutting steps, to smaller mRNA before translated into protein. As a result, a plurality of mRNAs may be generated from originally the same gene nucleotide sequence or region by transcription and splicing. Therefore, it can be determined whether or not a gene is expressed in an organism by checking for the presence of only a single base sequence of mRNA derived from the same nucleotide sequence or DNA region. A cDNA library is a database of DNA sequences (hereinafter abbreviated to cDNA) provided by sequencing mRNAs expressed in an organism, reverse-transcribing the resulting mRNA sequences by using a reverse transcriptase to reproduce them as DNA sequences corresponding to the original DNA. The cDNA database reflects the generation process of mRNA and therefore contains a number of cDNAs obtained from mRNAs, which are derived from the same gene region of DNA and have different base chain lengths. Conducting experiments on cDNAs derived from the same regions increases the experiment costs and therefore is often undesirable with the object of determining whether or not appropriate protein is expressed in each particular region in gene regions as described above. Therefore, it is crucially important to accomplish accurate clustering that assembles base sequences obtained from cDNA derived from the same gene region into a single group. This is because such clustering can hasten and efficiently identify tasks for elucidating the function of a particular gene region, reduce experiment costs, and increase search range.
Unfortunately, the above-described clustering involves enormous computational complexity and accordingly it is difficult to obtain significant results within a realistic time period. For example, a method known as spliced alignment has been used to determine whether or not two base sequences constitute a “spliced pair” generated by splicing. This method requires a significant expenditure of computational resources and therefore it is extremely difficult to carry out calculations on all the pairs contained in a typically massive set of sequences such as a cDNA library. A database called FANTOM, which is a mouse cDNA library, contains 21,076 base sequences. It would take more than 100 years for one typical computer to carry out calculations on all of the base sequences in the FANTOM mouse cDNA database. In order to solve the problem, various improvements on the spliced alignment have been considered.
Numerous approaches to improving the efficiency of the above-described clustering have been considered. For example, a Hidden Markov model is used to model a spliced alignment to accomplish alignment in “Optimal Spliced Alignment of Homologous cDNA to a Genomic DNA Template” (Jonathan Usuka, Wei Zhu and Volker Brendel, BIOINFORMATICS Vol. 16, No. 3, 2000, pp. 203-211). Usuka et al. disclose a method for obtaining regions corresponding to cDNA from a text, which is a long sequence (DNA of an organism). According to Usuka et al., particularly, a suffix array is used to select regions that share a 12-mer (a series of 12 bases) in a text array as candidates. It is noted that Usuka et al. do not explain the reason why they used the 12-mer base sequences and do not clarify whether the method can flexibly accommodate variations in chain length of base sequences.
“A New Indexing Method for Approximate String Matching” (G. Navarro and R. Baeza-Yates, Proc. CPM99, LNCS1645, pp. 163-185, 1999) discloses an approximate pattern matching in which an edit distance is defined and partial sequences having an edit distance less than or equal to a predetermined maximum allowable spliced edit distance k are found from a text array. Navarro et al. divide a sequence into d partial sequences, find in the text a partial sequence having an edit distance shorter than each individual partial sequence k/d, and treat it and its surroundings as candidates.
In “EST_GENOME: A Program to Align Spliced DNA Sequences” (R. Mott, CABIOS, Vol. 13, No. 4, 1997, pp. 477-4′78), a Smith-Waterman dynamic programming algorithm is modified to impose a penalty on splice sites in such a way that splice sites shorter than the minimum allowable length of a splice site are excluded. While various other methods have been proposed, none of them are adequate for clustering base sequences while flexibly accommodating variations in chain length of base sequences with reduced computation time and acceptable amount of hardware resources.
While the prior-art approaches described above disclose clustering approaches, all of them perform the clustering by using criteria (such as conventional similarity) that does not take splicing into consideration and none of them provide a clustering method that takes before-and-after-splicing relation into consideration.
Thus, there is a clear need for a technology that uses spliced alignment to quickly, efficiently, and accurately select candidate base sequences with an adequately high accuracy without omissions before base sequence clustering. Although there are various prior-art approaches as described above, there remains a need for a cluster generating system, a method for enabling base sequence clustering, a program for performing the method, and a computer-readable storage medium containing the program that can associate base sequences held in a database, such as cDNA database, with base sequences that are likely to be generated by splicing from the stored cDNA in order to quickly generate clusters, thereby conserving calculation time and hardware resources. Also, there is a need for a cluster generating system, a method for enabling base sequence clustering, a program for performing the method, and a computer-readable storage medium containing the program that allows a user to generate clusters in a limited time period within reasonable, that is, limited, hardware resource constraints.
In addition, there has been need for a base sequence information system that enables base sequence information relating to spliced pairs to be provided efficiently to a user.