1. Field of the Invention
The present invention relates to a method for comparing of DNA base sequences and a method for searching for DNA base sequences. In particular, it relates to a method for high-sensitivity detection of similarities between DNA base sequences and a method for estimation of an amino acid sequence coded for by a DNA base sequence.
2. Description of the Related Art
In recent years, there has been the following increasing trend: the DNA base sequences of various organisms are determined and the function of a protein coded by each DNA base sequence is analyzed. The DNA base sequence is a sequence of four kinds of bases A, C, G and T, and portions of the DNA base sequence code for biofunctional proteins, respectively. Of these proteins, those having an important function can be utilized, for example, for design and development of drugs, and there is desired a technique for accurately estimating the function of the protein coded for by the DNA base sequence. In general, the determination of the DNA base sequence is technically easier than experimental protein sequencing.
The function of a protein coded by a newly determined DNA base sequence is estimated as follows: the DNA base sequence is translated into an amino acid sequence (which permits protein sequencing) by using the well-known codon table (each of the starting point of translation into amino acids, the terminating point of translation into amino acids and the kinds of amino acids are prescribed in terms of a triplet nucleotide unit (a codon unit)), and the result of the protein sequencing is compared with data on a protein having a known function, to judge whether the proteins are similar or not.
In a DNA base sequence, the exon region coding for protein information is a region to be translated into amino acids. The codons are unequivocally translated into the amino acids. When the direction of translation of the DNA base sequence and the translation starting point are known, the DNA base sequence can be translated into an amino acid sequence, i.e., a protein by picking out triplets of successive nucleotides from the DNA base sequence in succession. However, if there is an error due to a nucleotide insertion or deletion in the DNA base sequence, the exon region of the DNA base sequence is shifted. Since the DNA base sequence is translated into amino acids as codon units, it is translated into completely different amino acids if a nucleotide insertion or deletion is present.
For comparing two DNA base sequences by translating them into amino acid sequences, respectively, and comparing these translated amino acid sequences, the translated amino acid sequences should be determined from the respective DNA base sequences.
FIG. 1 is a diagram illustrating 6 kinds of reading frames in a DNA base sequence in the translation of the DNA base sequence into an amino acid sequence [(first prior art): for example, reference 1: Biotechnology textbook series 11 xe2x80x9cIntroduction of Computer in Biotechnologyxe2x80x9d written by Haruki Nakamura and Kenta Nakai, pp. 66-67 (1995), CORONA PUBLISHING CO., LTD., Tokyo, Japan)].
The 6 kinds of the reading frames are as follows:
Frame (1): a frame according to which a DNA base sequence is translated into an amino acid sequence as codon units from the 5xe2x80x2-terminal of the DNA base sequence.
Frame (2): a frame according to which the DNA base sequence is translated into an amino acid sequence as codon units while shifting the starting position of each codon by one base from that in frame (1).
Frame (3): a frame according to which the DNA base sequence is translated into an amino acid sequence as codon units while shifting the starting position of each codon by two bases from that in frame (1).
Frame (4): a frame according to which the translation of a sequence complementary to the DNA base sequence into an amino acid sequence as codon units is initiated from the 5xe2x80x2-terminal of the complementary sequence.
Frame (5): a frame according to which the complementary sequence is translated into an amino acid sequence as codon units while shifting the translation starting position by one base from that in frame (4).
Frame (6): a frame according to which the complementary sequence is translated into an amino acid sequence as codon units while shifting the translation starting position by two bases from that in frame (4).
From frame (1) to frame (3), the translation starting position is shifted base by base from the 5xe2x80x2-terminal. From frame (4) to frame (6), the translation starting position is shifted base by base from the 5xe2x80x2-terminal of the sequence complementary to the original DNA base sequence (the 3xe2x80x2-terminal of the original DNA base sequence). Therefore, there are the six kinds of reading frames (1) to (6). A DNA base sequence is translated into an amino acid sequence by employing each of frames (1) to (6). Amino acid sequences translated from two DNA base sequences, respectively, by employing the same frame are compared. Thus, 6 kinds, in all, of amino acid sequences translated from one of the DNA base sequences are compared from those translated from the other DNA base sequence.
As a typical program for searching similar sequences, there is widely known BLAST developed by Altshul et al. of NCBI, a branch of U.S. NIH, the source program of which has been disclosed (see, for example, the first reference, pages 141 to 143). The BLAST family includes BLASTN for comparing DNA base sequences, BLASTP for comparing amino acid sequences, BLASTX for searching for each of 6 kinds of amino acid sequences mechanically translated from a DNA base sequence according to each of the above-mentioned 6 kinds of frames, by using an amino acid sequence data base, and TBLASTX for mechanically translating each of a query DNA base sequence as a first DNA base sequence and a DNA base sequence read out of a DNA base sequence data base (a target DNA base sequence) as a second DNA base sequence according to each of the above-mentioned 6 kinds of the frames, and comparing 36 combinations of 6 kinds of amino acid sequences translated from the first DNA base sequence and 6 kinds of amino acid sequences translated from the second DNA base sequence. In the case of the BLAST family, high-speed pattern matching of a base sequence having a definite length in a query DNA base sequence with a target DNA base sequence was carried out at first, and a region similar to the query DNA base sequence is detected on the basis of the position of a base sequence with a definite length detected in the target DNA base sequence.
In the Smith-Waterman method, each base of a query DNA base sequence is compared with each base of a target DNA base sequence, a score (a similarity) suitable for the combination of the two bases is given, the scores (similarities) thus given are accumulated, and there is sought a path (an alignment) in which the accumulated score (similarity) becomes maximum [(third prior art): for example, reference 2: xe2x80x9cIdentification of Common Molecular Subsequencesxe2x80x9d, J. Mol. Biol.,147 (1981), pp. 195-197].
In the third prior art, the combinations of two bases of two DNA base sequences, respectively, are compared by a dynamic programming method, and scores between the two DNA base sequences are determined. When a DNA base sequence similar to a specific noted DNA base sequence (hereinafter referred to as xe2x80x9cquery DNA base sequencexe2x80x9d or xe2x80x9cfirst DNA base sequencexe2x80x9d) is searched for in a DNA base sequence data base, a matrix is formed by aligning the bases of the query DNA base sequence (number of bases: M) in regular order from the 5xe2x80x2-terminal along a first axis (for example, x-axis) and the bases of a DNA base sequence (number of bases: N) read out of the DNA base sequence data base (hereinafter referred to as xe2x80x9ctarget DNA base sequencexe2x80x9d or xe2x80x9csecond DNA base sequencexe2x80x9d) in regular order from the 5xe2x80x2-terminal along a second axis (for example, y-axis) (in the present specification, such a matrix is hereinafter referred to xe2x80x9cscore matrixxe2x80x9d) (FIG. 2).
FIG. 2 is a diagram illustrating accumulation paths of scores for comparing the first and second DNA base sequences. Each combination of the two bases of the first and second DNA base sequences, respectively, is expressed as the position of a score matrix element (i, j) (i=1, 2, - - - , M; j=1, 2,- - - , N).
In the dynamic programming method, shift paths (search paths) in three directions, the vertical direction, the horizontal direction and the bias direction (the directions a, b and c, respectively, shown in FIG. 2) to a score matrix element (i, j) are considered, and the position of (i, j) is shifted toward a score matrix element (M, N) at the lower right corner from the score matrix element (1, 1) at the upper left corner shown in FIG. 2, by changing the number i from 1 to M and the number j from 1 to N, whereby there is determined the optimum path (the optimum alignment) which shows the optimum combinations for similarities of the bases of the first DNA base sequence and the bases of the second DNA base sequence.
The value H(i, j) of a score matrix element (i, j) indicates an accumulated similarity (score) between a base sequence from the first base to the i-th base in the first DNA base sequence and a base sequence from the first base to the j-th base in the second DNA base sequence. For the shift paths in the directions a, b and c shown in FIG. 2, the accumulated similarities (scores), Ha(i, j), Hb(i, j) and Hc(i, j), respectively, are defined by the (equation 1), (equation 2) and (equation 3) shown below, by using a score s(i, j) indicating the similarity between the i-th base of the first DNA base sequence and the j-th base of the second DNA base sequence, a gap penalty score p and accumulated similarities (scores) H(ixe2x88x921, jxe2x88x921), H(ixe2x88x921, j) and H(i, jxe2x88x921) at score matrix elements (ixe2x88x921, jxe2x88x921), (ixe2x88x921, j) and (i, jxe2x88x921), respectively, at the original points before shift to the point (i, j). The maximum among Ha(i, j), Hb(i, j) and Hc(i, j) [(equation 4)] is selected as H(i, j). The above-mentioned score s(i, j) can be determined using a previously stored score table. For example, a score of 4 is given to a combination of the same bases, a score of xe2x88x928n-4 is given when the number of inserted or deleted nucleotides is n, and a score of xe2x88x923 is given to a combination of two different bases.
Ha(i, j)=H(ixe2x88x921, jxe2x88x921)+s(i, j) xe2x80x83xe2x80x83(equation 1) 
Hb(i, j)=H(i, jxe2x88x921)+p xe2x80x83xe2x80x83(equation 2) 
Hc(i, j)=H(ixe2x88x921, j)+p xe2x80x83xe2x80x83(equation 3) 
H(i, j)=max{Ha(i, j), Hb(i, j), Hc(i, j)}xe2x80x83xe2x80x83(equation 4) 
The gap penalty score p added in the shift path b corresponds to the presence of a nucleotide deletion after the i-th base of the first DNA base sequence, and the gap penalty score p added in the shift path c corresponds to the presence of a nucleotide deletion after the j-th base of the second DNA base sequence.
The first and second DNA base sequences are compared by varying the number i from 1 to M and the number j from 1 to N in shift paths from the score matrix element (1, 1) to the score matrix element (M, N), and scores or gap penalty scores are added up in each shift path, whereby there is determined H*=H(M, N), the maximum accumulated similarity (score) between the whole first DNA base sequence and the whole second DNA base sequence. Consequently, it is possible to determine an alignment which gives the greatest similarity between the first and second DNA base sequences, namely, the optimum alignment showing the optimum combinations of the bases of the first DNA base sequence and the bases of the second DNA base sequence.
The third prior art is applicable not only to the investigation of similarities between two DNA base sequences but also to the investigation of similarities between two amino acid sequences.
The above-mentioned first prior art involves the following problem. When a nucleotide insertion or deletion is present in a DNA base sequence, a frame shift occurs at the position of the nucleotide insertion or deletion, and an amino acid sequence coded for by the portion of the base sequence after the frame shift position does not have any similarity which would be given if there were no nucleotide insertion or deletion. Therefore, an amino acid sequence cannot be found which would be obtainable if there were no nucleotide insertion or deletion. Thus, a miss of omission occurs in the search.
Even if an amino acid sequence very similar to an amino acid sequence obtained by translation using, for example, the frame (1) among the 6 kinds of the frames in a DNA base sequence is present in an amino acid sequence translated from another DNA base sequence, the following problem is caused when a nucleotide insertion or deletion is present in the DNA base sequence: the position of the frame is changed to that of the frame (2) or the frame (3) in the portion of the base sequence after the position of the nucleotide insertion or deletion. In the prior art, there has been disclosed neither a method for comparison of DNA base sequences nor a method for search for DNA base sequences, which has been developed in view of a change of reading frame caused by a nucleotide insertion or deletion in the DNA base sequence.
The BLAST family including TBLASTX in the above-mentioned second prior art is disadvantageous in that a miss of omission occurs in the search because gaps due to nucleotide insertions or deletions in a DNA base sequence or amino acid insertions or deletions in an amino acid sequence are not considered for assuring high-speed calculation.
The above-mentioned third prior art is an accurate search method but is disadvantageous in that it requires a long period of time because each base of a DNA base sequence is compared with each base of another DNA base sequence. When the third prior art is combined with the first prior art, namely, each of two DNA base sequences, a quetry DNA base sequence and a target DNA base sequence is translated into an amino acid sequence and the translated amino acid sequences are compared, a longer search time is required because it is necessary to compare 36 combinations of 6 kinds of amino acid sequences translated from the first DNA base sequence according to the 6 kinds of the frames, respectively, explained in the first prior art and 6 kinds of amino acid sequences translated from the second DNA base sequence according to the 6 kinds of the frames, respectively.
Moreover, when the Smith-Waterman method as the third prior art is combined with the first prior art, the insertion or deletion of amino acids or the insertion or deletion of nucleotides as codon unit in a DNA base sequence can be considered, but the insertion or deletion of nucleotides in a number other than multiples of 3 (i.e. the number of nucleotides constituting a codon unit) in a DNA base sequence cannot be considered. Therefore, the change of the position of frame cannot be considered.
In the prior arts, there is not considered the prevention of the production of erroneous results due to nucleotide insertions or deletions in a DNA base sequence. That is, it is not considered that the DNA base sequence is translated into an amino acid sequence in view of the presence of the nucleotide insertions or deletions.
Japanese Patent Application No. 7-265157 [reference 3: application date in Japan: Oct. 13, 1995 (JP-A-09-105748 (laid-open date in Japan: Apr. 22, 1997))] which is not a known reference discloses a method for comparison of DNA base sequences which comprises dividing each of first and second DNA base sequences into triplets of successive nucleotides, to form first and second, respectively, intermediate DNA base sequences, translating each of the first and second intermediate DNA base sequences into amino acids to form first and second, respectively, translated amino acid sequences, determining a first similarity between the first DNA base sequence and the first intermediate DNA base sequence, a second similarity between the second DNA base sequence and the second intermediate DNA base sequence, and a third similarity between the first translated amino acid sequence and the second translated amino acid sequence, and choosing the first and second intermediate DNA base sequences and the first and second translated amino acid sequences so that a parameter obtained from the first, second and third similarities by the use of a predetermined function may be maximum.
Japanese Patent Application No. 8-167770 (reference 4: application date in Japan: Jun. 27, 1996) which is not a known reference discloses a method for comparison of sequences which comprises translating a query DNA base sequence into amino acids in view of nucleotide insertions or deletions, comparing the resulting translated amino acid sequence with a target amino acid sequence read out of an amino acid data base, according to the Smith-Waterman method, determining the score (similarity) between the i-th amino acid of the translated amino acid sequence and the j-th amino acid of the target amino acid sequence in view of 7 kinds of paths, and thereby aligning the translated amino acid sequence with the target amino acid sequence.
The reference 3, however, does not disclose a technique concerning a specific example of path in calculation according to the dynamic programming method. The reference 4 discloses a method comprising picking out successive codons each having a starting position one or two bases after that of the preceding codon, in the translation of a query DNA base sequence into an amino acid sequence (which corresponds to the first translation method employed in the present invention), but does not disclose the second and third translation methods employed in the present invention which are explained hereinafter in detail. The reference 4 does not disclose a technique for comparing an amino acid sequence translated from a query DNA base sequence with an amino acid sequence translated from a DNA base sequence read out of a DNA base sequence data base.
An object of the present invention is to provide a method for comparison of DNA base sequences which hardly causes a miss or omission in search and comprises translating each of a query DNA base sequence and a DNA base sequence read out of a DNA base sequence data base (a target DNA base sequence) into an amino acid sequence, and thereby comparing the two DNA base sequences through the translated amino acid sequences, in particular, a method for high-sensitivity detection of similarities between DNA base sequences and a method for estimation of an amino acid sequences coded for by a query DNA base sequence.
In the method for comparison of DNA base sequences of the present invention, when similarities between first and second DNA base sequences are investigated, each DNA base sequence is first divided into triplets of successive nucleotides which may involve a nucleotide insertion or deletion. Each of the triplets is translated into an amino acid according to the codon table. Similarities between each amino acid of the thus obtained first translated amino acid sequence and each amino acid of the thus obtained second translated amino acid sequence are accumulated in view of amino acid insertions or deletions in each amino acid sequence to obtain an accumulated score (similarity). There are determined combinations of amino acids of the first translated amino acid sequence and those of the second translated amino acid sequence which give the maximum accumulated similarity (the maximum accumulated score). Thus, there are attained the maximum accumulated score, the alignment of the first and second translated amino acid sequences, and the alignment of the DNA base sequence corresponding to the first translated amino acid sequence with the DNA base sequence corresponding to the second translated amino acid sequence. A specific noted DNA base sequence (a query DNA base sequence) is used as the above first DNA base sequence, and a known DNA base sequence read out of any of various DNA base sequence data bases (a target DNA base sequence) is used as the above second DNA base sequence.
As a method for translating each DNA base sequence into an amino acid sequence which is adopted in the method for comparison of DNA base sequences of the present invention, the following first, second and third translation methods are employed in combination.
In the first translation method, the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table while shifting a reading frame for the DNA base sequence at every triplet of successive nucleotides base by base from the end of the DNA base sequence.
In the second translation method, a reading frame for the DNA base sequence is shifted at every quartet of successive nucleotides base by base from the end of the DNA base sequence, the second of the four nucleotides of each quartet is taken as an inserted nucleotide, and the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table by using the remaining three of the four nucleotides.
In the third translation method, a reading frame for the DNA base sequence is shifted at every quartet of successive nucleotides base by base from the end of the DNA base sequence, the third of the four nucleotides of each quartet is taken as an inserted nucleotide, and the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table by using the remaining three of the four nucleotides.
In the method for comparison of DNA base sequences of the present invention, a dynamic programming method is employed for calculating the accumulated score in the comparison of the first and second amino acid sequences translated from the first and second, respectively, DNA base sequences. In the calculation according to the dynamic programming method, when there are accumulated scores (similarities) between the i-th amino acid of the first translated amino acid sequence and the j-th amino acid of the second translated amino acid sequence which is represented by a score matrix element (i, j), there are considered seven paths from score matrix elements (ixe2x88x923, jxe2x88x923), (i, jxe2x88x923k), (ixe2x88x923k, j), (ixe2x88x923n+1, jxe2x88x923n), (ixe2x88x923n, jxe2x88x923n+1), (ixe2x88x923m, jxe2x88x923mxe2x88x921) and (ixe2x88x923 mxe2x88x921, jxe2x88x923m), respectively, wherein k is an integer in a range of kxe2x89xa71, m is an integer in a range of mxe2x89xa71, and n is an integer in a range of nxe2x89xa72. When k=1, m=1 and n=2, there are considered paths from score matrix elements (ixe2x88x923, jxe2x88x923), (i, jxe2x88x923), (ixe2x88x923, j), (ixe2x88x925, jxe2x88x926), (ixe2x88x926, jxe2x88x925), (ixe2x88x923, jxe2x88x924) and (ixe2x88x924, jxe2x88x923), respectively. The elements in the parentheses are positive numbers. The symbol i is an integer in a range of ixe2x89xa6M wherein M is the number of amino acids in the first translated amino acid sequence, and the symbol j is an integer in a range of jxe2x89xa6N wherein N is the number of amino acids in the second translated amino acid sequence.
According to the present invention, similarities between the DNA base sequences can be compared through the translated amino acid sequences. Therefore, the comparison can be carried out in detail by listing scores reflecting not only the agreement or disagreement of amino acids but also chemical characteristics (e.g. the hydrophilicity or hydrophobicity of amino acids) and physical characteristics (e.g. the size of amino acids) in a score table used for the comparison for the similarities. Thus, the sensitivity of search for the similarities between the DNA base sequences is enhanced.
Furthermore, misses or omissions in the search can be reduced because the comparison can be carried out in view of nucleotide insertions or deletions in the DNA base sequences and amino acid insertions or deletions in the translated amino acid sequences.
The method for comparison of DNA base sequences of the present invention is summarized as follows with reference to FIG. 3. Each of a query DNA base sequence and a DNA base sequence read out of a data base is translated into an amino acid sequence (304, 306), similarities between the translated amino acid sequences are calculated in view of nucleotide insertions or deletions and amino acid insertions or deletions, followed by score accumulation by a dynamic programming method (307), top accumulated scores and paths are calculated by the dynamic programming method, for two translated amino acid sequences giving the top accumulated scores which have been obtained by the similarity search (312), tracing of a path giving the maximum accumulated score is calculated (313), and the result of alignment of the translated amino acid sequences is displayed together with that of alignment of the DNA base sequences. Even if a nucleotide insertion or deletion is present in the two DNA base sequences to be compared, it becomes possible to determine similarities between the DNA base sequences through the translated amino acid sequences. Therefore, the sensitivity of search is enhanced.