The present invention generally relates to genetic motif extracting methods and apparatuses, and more particularly to a genetic motif extracting method for extracting a motif that is a conserved region or site among a plurality of genetic sequences by comparing a plurality of genetic sequence information, and to an apparatus which employs such a genetic motif extracting method.
Due to the recent progress made in genetic engineering, there are increased number of genetic sequence information databases representing DNA sequences or amino acid sequences. In addition, attempts to elucidate all genetic sequences of specific organisms are being made on a world wide basis, such as the human genome project, and it is expected that the genetic sequence information will rapidly increase in the future.
Among such genetic sequences, there are many genetic sequences whose sequence information is known but the functions and structures thereof are unknown. As an effective method of predicting the functions and structures of the genes from the sequence information, there is the method of retrieving motifs, that is, regularities in the distinctive features of the genetic sequences. Therefore, there is a need to realize a technique for extracting a large number of motifs from the genetic sequences whose sequences are known.
Conventionally, the motifs which specify the genetic functions in the genetic sequences and indicate the regularities of the distinctive features of the sequences have been determined based on experiments and reports in literature. A database called PROSITE is known as a database which registers such motifs.
It is known that, in general, the functionally important regions or sites of the genetic sequences are less likely to change. By utilizing this fact, it is possible to extract the motifs as conserved regions or sites of the genetic sequences, through comparison of the genetic sequences. However, a technique for extracting the motifs through comparison of a plurality of genetic sequences has not been established.
It would require considerably work to determine the motifs by human work based on experiments or the like. Hence, it may be regarded that a large amount of information effective for the purposes of elucidating the genetic functions can be obtained if it is possible to extract the motifs mechanically, that is, automatically, from a comparison of the genetic sequences. However, the following problems occur if the sites of the genetic sequences are simply compared and the similarities of the sites are checked.
In other words, if the plurality of genetic sequence information which is the subject of the motif extraction is biased to specific organisms, the regularities or distinctive features that are to be extracted become biased. For example, suppose that a large amount of genetic sequence information related to advanced organisms such as genetic sequence information of humans, genetic sequence information of monkeys and genetic sequence information of horses exists, and the motifs are to be extracted from sequence information groups having a small amount of genetic sequence information related to less advanced organisms based on the similarities of the sites. In such a case, it cannot necessarily be concluded that the sites with high similarities are the conserved sites which did not change much during the evolution process. On the other hand, it cannot necessarily be concluded that the sites with low similarities are not conserved sites. Therefore, there is a possibility that the conserved sites which are extracted as the motifs may be erroneously concluded,