In general, characteristic nucleic base sequences which are called promoters exist in the vicinity of gene coding region of a DNA segment (hereinafter simply referred to as a DNA segment). In the DNA sequence, the promoter is a gene transcription's control site possessing a specific pattern. Therefore, it is very important to decide whether or not the promoter is included in the DNA segment.
FIG. 41 shows an example of a state where an RNA polymerase (ribonucleic acid polymerase which is a kind of enzyme) searches for an Escherichia coli promoter in an Escherichia coli nucleic base sequence. Only a portion of the nucleic base sequences in the long DNA sequence is transcribed to an RNA sequence. Then, the RNA sequence is translated into an amino acid chain so that a protein is synthesized.
When the RNA polymerase meets a DNA sequence, the RNA polymerase is weakly bound and slides on the DNA. When the RNA polymerase meets the promoter, the RNA polymerase is strongly bound to start the transcription of the DNA sequence.
By performing a biological experiment in a test tube or an X-ray analysis, it can be decided whether or not the promoter is included in the DNA segment.
However, these methods have problems of an increase in testing time and cost. Particularly, the X-ray analysis requires a safety measure.
Therefore, instead of these methods, there is proposed a method of preparing DNA segments constructed with 4 nucleotide symbols A, T, G, and C as a discrete value data in a computer and performing a calculation process for deciding the existence of the promoters. Such a method is very important in order to process a large number of DNA segments at a high speed and low cost.
Such a method of deciding the existence of the promoter in a DNA segment constructed with a nucleic symbol sequence A, T, G, and C may be considered to be a good method, but this is not very simple in reality. This is because there is a large variation in the promoter patterns. For example, a portion of nucleotides constituting the promoters may be different from each other, or the whole lengths and positions of the promoters may be different from each other. Therefore, conventionally, such a method for transforming a discrete symbol sequence of A, T, G, and C into a continuous value and scrutinizing the resulting patterns has been employed.
For example, there are a method using neural networks (see Non-Patent Document 1), and a method using a combination of the neural networks and the expectation-maximization algorithm (EM algorithm) (see Non-Patent Document 2). These methods are described also in a well-known monograph (see Non-Patent Document 3) on bioinformatics.
Besides the referred non-patent methods, there is a class of chemical classification apparatus for classifying the information indicating a change in an amount of plural types of chemicals (including genes and by-products of genes) with a high accuracy. Such an apparatus includes the principal component analysis (PCA), and often the independent component analysis (ICA) is further used (see Patent Document 1). Unlike the present invention, however, these apparatuses are not applicable to the recognition or the prediction of discrete symbol patterns such as promoters.
A document about a homology score used for a process according to the second embodiment of the present invention is provided as a reference (see Non-Patent Document 4).    Patent Document 1 Japanese Patent Application Publication No. 2003-141102 (claim 1, Abstract)    Non-Patent Document 1 I. Mahadevan and I. Ghosh, “Analysis Of E. Coli Promoter Structures Using Neural Networks”, Nucleic Acids Research, 1994, vol. 22, p. 2158-2165    Non-Patent Document 2 Q. Ma, T. L. Wang, D. Shasha, and C. H. Wu, “DNA Sequence Classification Via An Expectation Maximization Algorithm And Neural networks: A Case Study”, IEEE Transactions on Systems, Man and Cybernetics, Part-C: Applications and Reviews, 2001, vol. 31, p. 468-475    Non-Patent Document 3 D. W. Mount, “Bioinformatics: Sequence And Genome Analysis”, Cold Spring Harbor Laboratory Press, 2001 (“Bioinformatics” translated by Yasushi Okazaki and Hidemasa Bono, Medical Science International, 2002)    Non-Patent Document 4 Martin E. Mulligan, Diane K. Hawley, Robert Entriken, William R. McClure, “Escherichia Coli Promoter Sequences Predict In Vitro RNA Polymerase Selectivity”, Nucleic Acids Research, 1984, vol. 12, p. 789-800