The present invention relates to a method and a device for evaluating a decision condition with sequence motifs used for discrimination of symbol sequences such as amino acid sequences, DNA sequences and other sequences composed of symbols aligned in series.
The symbol sequences can be classified into a plurality of groups or categories so that each of the categories contains symbol sequences related to each other. The majority of the symbol sequences in one category can have one or more common symbol sequence portions or patterns. Such common patterns are called sequence motifs.
For example, proteins are known as various sequences of amino acids. Twenty (20) symbols, that is, A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y are used to denote twenty amino acids, respectively, and amino acid sequences are represented as symbol sequences by use of those symbols. Proteins or amino acid sequences are classified into a plurality of super families or categories of functionally related proteins. The majority of the amino acid sequences classified in one category have one or more sequence motifs which represent conserved amino acid residues. For example, a pattern "CXXCH" is known as a common pattern or a sequence motif for amino acid sequences in the cytochrome c which is a super family of proteins. Here, "C" and "H" represent amino acids as described above and "X" represents an arbitrary one of twenty amino acids.
Those sequence motifs can be used as indexes for discriminating given symbol sequences and/or for predicting categories of unknown symbol sequences. That is, the sequence motifs are used in the decision condition for discriminating symbol sequences.
When those sequence motifs are used as the indexes for the decision condition, an inference rule is described as a decision predicate including the sequence motif or motifs. An example of the decision predicate is as follows: "If the given symbol sequence contains sequence motif CXXCH, the symbol sequence corresponds to cytochrome c . . . " However, it is very hard to find such a deterministic inference rule, because of the existence of noise, or uncertainty, due to the variety of biological species.
An actual amino acid sequence data bank (PIR 18.0) has 6158 sequences registered. 189 sequences therein have the sequence motif CXXCH but 119 sequences are classified in cytochrome c. Accordingly, the sequence motif CXXCH does not make a complete decision condition for discriminating amino acid sequences as category of cytochrome c. Another sequence motif can be used as the index in place of CXXCH or in combination with CXXCH, but cannot give the complete decision condition because of the noise or uncertainty.
To overcome the difficulty, the following type of rule is more appropriate to express the decision predicate with probability as follows: "If the given sequence contains the motif "CXXCH", it corresponds to cytochrome c with probability 4/5, but otherwise with probability 1/5".
Accordingly, sequence motifs are not always present in all of symbol sequences classified in one category but are present with certain probabilities.
Distance information between two sequence motifs are known to be used as an item of the decision condition.
Therefore, a plurality of decision condition can be made for discriminating a given symbol sequence as one of categories, according to use of sequence motifs and the distance information.
It is desired to evaluate those decision conditions in order to select the optimal one of those decision conditions.
It is another problem to select optimal motifs used in the decision predicates.