1. Field of the Invention
The present invention relates to a method for clustering and assembling a large number of nucleic acid base sequences at a high speed.
2. Description of the Related Art
The completion of the base sequence determination of human genome has been announced by international joint projects and an U.S. venture company in June of 2000. With improvements in DNA sequence determination technology such as the widespread use of a DNA sequencer utilizing four colors of fluorescent dyes or capillary, complete genome sequences of several tens of varieties of microorganisms including E. coli and S. cerevisiae and multicellular organisms such as C. elegans or D. melanogaster have determined, and draft sequences of the human genome have also become available. In addition, genome projects on various kinds of organisms such as mouse and rice plant are in progress.
While the analysis of the genome sequence proceeds, an analysis of mRNA is also conducted in order to study genes being expressed. mRNA is a sort of RNA which is produced from genome DNA upon gene expression and is a substance which is essential in the course of functional expression of the gene. mRNA which is easily degraded is frequently analyzed in the form of cDNA because mRNA can be easily converted into cDNA, which is more stable than mRNA, through reverse transcription. A sequence obtained by single-pass sequence analysis of cDNA is referred to as an Expressed Sequence Tags (ESTs). ESTs can be utilized for various applications and one of them is to obtain an mRNA sequences.
FIG. 13 schematically illustrates a clustering and assembling processes of the ESTs derived from mRNAs.
When mRNAs are converted into cDNAs 1301, it is difficult to obtain full length cDNAs including 5′-ends, thus the resulting ESTs 1302 based on the cDNAs become sequences in which positions of the 5′-ends usually vary as shown in FIG. 13. When ESTs derived from a cDNA library prepared from all RNAs of a cell or tissue is analyzed, only a set 1303 of ESTs can be obtained. Therefore, it is impossible to know in advance that which mRNA has contributed to a given EST. In a sequence set 1303 made by collecting ESTs 1302, sequences are combined (assembled) to each other based on similar parts 1305 thereof and divided (clustered) into smaller sets as symbolically indicated by arrows 1304. This process allows for identification of ESTs obtained from the same mRNA, and further, sequences 1306 can be obtained having partially reconstructed mRNA sequences.
As for human, it is said that more than a hundred thousand of mRNAs exist corresponding to the number of proteins, so that it is ideal to obtain assemblies corresponding to the respective mRNA sequences by clustering and assembling the input sequence data including ESTs. Presently, about 3.9 million sequences of unprocessed human-originated ESTs and about 1.5 million sequences of human including ESTs partitioned into a set of gene-oriented clusters are stored in a database managed by a U.S. public institution. As a focus of the study shifts to the gene function analysis with the progress of the genome sequence determination, it is expected that the number of sequences derived from mRNA required to be analyzed will be further increased.
The assembling technology is also essential for the genome sequence determination. The determination of genome sequence primarily uses a shotgun method. In the sequence determination by a shotgun method, a long DNA is separated into lots of smaller fragments which are to be cloned, a sequence of each fragment is determined, and the sequence assembling is conducted to determine the entire sequence. For example, a genome sequence of E. coli has about 4639K bases, and its sequence determination by the shot-gun method with a redundancy of 10 usually required requires assembling of 4.639×106×10/500=9.278×105 sequences, considering that a length of the sequence obtained through a single electrophoresis on a DNA sequencer is about 500 bases. On the other hand, genome sizes of higher organisms such as C. elegans, mice, and humans are greater than that of E. coli by two or three orders of magnitude, so that it is estimated that the number of sequences required for the genome determination will reach a ten million to a hundred million. As the determination of genome sequences of various organisms will be continuously conducted in future, the number of sequences subjected to the assembling is expected to be further increased.
As for the huge number of nucleic acid base sequences, it is difficult in view of a computation time to study the interrelation among respective sequences and to conduct the clustering or assembling thereof. A primary problem in clustering and assembling sequences is how to search for their overlaps between sequences efficiently. If the search for the overlap is simply conducted on all pairs of sequences, it requires to search combinations on the order of the square of the number of sequences, so that an increase in the number of sequences leads to a substantial increase in the processing time. However, the order of entire processing of clustering and assembling is desirable to be extremely lower than the order of the square of the number of sequences.
Among approaches of efficiently searching for an overlap for the clustering and assembling is a method described in Huang, X. and Madan, A., Genome Research, 9:868–877, 1999. However, the number of overlap required to be processed still reaches the order of the square of the number of sequences, so that entire processing of clustering and assembling also reaches the order of the square of the number of sequences. The number of sequences subjected to the clustering and assembling processes has been continuously growing, and it can be expected that the number will further continue to grow.
In view of such problems in the prior art, an object of the present invention is to provide a method and a device for clustering and assembling sequences in a certain computational complexity which is on the order of less than the square of the input sequence number, and for clustering and assembling a large number of nucleic acid base sequences at a high speed.