Knowledge concerning structures and functions of proteins obtained from genomic biology and post genomic biology can now be artificially reorganized on artificial proteins and actively utilized. As a method of rationally embedding a function on an artificial protein, a small base sequence (a microgene) is first designed to associate with a specific biological function, and then it is possible to reorganize the biological function on an artificial protein which is a translation product of a microgene polymer by polymerizing the microgene in a tandem manner (see Patent Document 1 and Non-Patent Document 1, for example), or by connecting plural microgenes (see Patent Document 2, for example). There is, for example, a method of microgene polymerization (see Patent Document 1 and Non-Patent Document 1, for example) to polymerize microgenes, which has an aspect that different translation reading frames of the microgenes are utilized in parallel. It is indispensable for the development of high-function artificial proteins to design and utilize a “multifunctional base sequence” which is embedded with a plurality of biological functions simultaneously in a plurality of reading frames, by taking advantage of this aspect of the microgene polymerization method (see Patent Document 3, for example).
To present, designing of such multifunctional base sequence underwent the process as follows: to set a given peptide sequence having a primary function as an initial value; to back-translate base by base to the base sequences according to a genetic code table; to create all base sequences capable of encoding the peptide sequence on the processor; then to write down a pool of peptide sequences which are encoded by all the base sequences created and which are arising from reading frames different from that of the first peptide sequence in the processor; and lastly to select peptides having the secondary and tertiary functions out of this pool of peptide sequences.
In this case, base sequences in which translation termination codons emerge in other reading frames at the junction points of residues in a peptide of the first reading frame also become objects of the calculation. Such base sequences accompanied with emergence of translation termination codons in other reading frames have to be excluded in the end from the standpoint of applicability of multifunctional genes. However, it was hard to exclude the base sequences in advance in a conventional algorithm as described above so that all the combinations had to be calculated, which required vast amount of calculation time. For example, there are approximately 687×108 variants of base sequences encoding the peptide sequence of NGNNGNNGNNGNNGNNGNGNNGNNGG (SEQ ID NO: 4) in its first reading frame, and among them only about 4×107 variants are devoid of translation termination codons in the second and third reading frames. In the conventional method, however, all the variants of about 687×108 had to undergo calculation.
Patent Document 1
Japanese Laid-Open Patent Application No. 1997-322775
Patent Document 2
Japanese Laid-Open Patent Application No. 1997-154585
Patent Document 3
Japanese Laid-Open Patent Application No. 2001-352990
Non-Patent Document 1
Proc. Natl. Acad. Sci. USA 94, 3805-3810, 1997
The subject of the present invention is to provide a method of designing a multifunctional base sequence wherein the calculation time is largely shortened and the volume of memory consumption of a processor is largely reduced by calculating with the advance exclusion of base sequences which are accompanied with the emergence of translation termination codons in the second and third reading frames, and which should be excluded in the end.
The present inventors have made a keen study to solve the above described subject and focused on the fact that a dipeptide sequence (two amino acid residues) or a peptide sequence with longer length already contains information about translation products in the second and third reading frames. Then the present inventors have found that, when proteins are analyzed and calculated by regarding proteins as the duplicated and connective products of dipeptide sequences (two amino acid residues) or of short sequences with length longer than dipeptides, unlike in conventional methods where proteins are analyzed as connective products of 20 kinds of amino acids, the information can be analyzed in such a way as the information of translation products of the second and third reading frames is included within, and therefore the calculation time is largely shortened and the volume of memory consumption of a processor can be reduced to a great extent.
FIG. 1 shows an example of the course of processing to back-translate into base sequences by single amino acid units. For instance, there are six codons encoding leucine (Leu); TTA, TTG, CTT, CTC, CTA and CTG. There are also six codons encoding serine (Ser); TCT, TCC, TCA, TCG, AGT and AGC. To perform back translation for all base sequences that are capable of encoding a dipeptide “Leu-Ser”, 6×6=36 variants of base sequences are first generated on the processor. Besides, for the case of the sequence “Leu-Ser-Arg” where arginine (Arg) is located on the third position, 36×6=216 variants of base sequences are generated on the processor. In this way, variants of base sequences corresponding to the total variants obtained by multiplying codons (1 to 6 variants) which have possibility for encoding the amino acid located at the Nth position are generated on the processor, and then the processing moves on to the exclusion of base sequences containing translation termination codons (TAA, TAG, TGA) in other reading frames from among the base sequences. Since a base sequence containing a translation termination codon in other reading frames cannot be used as a multifunctional base sequence in the end, the exclusion of them at this stage will largely reduce the burden on the later calculation processing.
Next, a processing is considered under the recognition that a polypeptide sequence is a pool of 400 dipeptide variants and not a connection of 20 amino acid residues. When considering a base sequence which encodes a dipeptide, the first amino acid residue of the second and third reading frames in the base sequence are already defined in the first place. Therefore, it becomes possible to exclude in advance the sequences containing termination codons out of the pool of base sequences encoding a dipeptide. As shown in the aforementioned FIG. 1, there are eight sequences containing termination codons in the second reading frames and two sequences containing termination codons in the third reading frames among all 36 variants of base sequences capable of encoding the dipeptide “Leu-Ser”. Therefore, it becomes possible to generate base sequences on the processor with the advance exclusion of termination codons by preparing 36−10=26 variants as codons corresponding to “Leu-Ser”.
For example, when carrying out back-translation for a peptide comprising three residues of “Leu-Ser-Arg” and generating base sequences encoding the peptide on a processor, the sequence is processed as a sequence where two dipeptides, “Leu-Ser” and “Ser-Arg”, are connected. Codons corresponding to “Leu-Ser” may thereafter be calculated for 6×6−10=26 variants as described above, and codons corresponding to “Ser-Arg” may be calculated for 6×6−4=32 variants (four variants contain termination codons in their second reading frames). Therefore, as shown in FIG. 2, it has become possible to obtain every base sequence with the length of 9-mer which encodes “Leu-Ser-Arg” in the first reading frame and not containing termination codons in the second and third reading frames by selecting and connecting the codon combinations, where serine is read by the same codon, from 26 variants of “Leu-Ser” 6-mer codons and from 32 variants of “Ser-Arg” 6-mer codons. As a result of this, (6×4)+(6×6)+(6×6)+(6×6)+(1×4)+(1×6)=142 variants would just be enough to be processed and calculated as shown in FIG. 2, whereas codon combinations according to the conventional methods required work of writing down sequences of 6×6×6=216 variants on a processor.
As described in the foregoing, an operation in which processing for the sequences which would finally be excluded due to the emergence of termination codons can be avoided by processing a polypeptide sequence as a pool of dipeptide units, preferably as a pool of sequential dipeptide units with duplicated amino acid residues, and by preparing a dipeptide-codon corresponding table (a corresponding table for nucleic acid sequences encoding dipeptides) where those having termination codons in the second and third reading frames are excluded in advance from codons of the dipeptide units. In fact, utilization of such algorithm enables the calculation time to be largely shortened as described later. Furthermore, it enables the necessary memory size to be also reduced to a great extent.
Besides, when a dipeptide-codon table, in which termination codons are excluded in advance, is translated in three reading frames, a sort of the first amino acids in the second and third reading frames are proved to be defined in the first place as FIG. 3 indicates. For example, the first reading frame TTA in the sequence of TTATCT for “Leu-Ser” is leucine (L), however, it is defined in the first place that the first amino acid in the second reading frame is tyrosine (Y) encoded by TAT, and the first amino acid in the third reading frame is isoleucine (I) encoded by ATC. Therefore, having given a dipeptide, thinkable sorts of amino acids in the second and third reading frames at that position are defined in the first place without back-translating to base sequences for each time. A considerable reduction in calculation processing can become possible by preparing in advance a “corresponding table for amino acids for each dipeptide-reading frame” to avoid the processing of back-translation to the base sequences. In this case, however, necessary information for connecting the first and the second dipeptide informations, as found in FIG. 2, is not included, and thus some extra information are needed for acquiring information about the possible “combinations”. Nevertheless, sufficient amount of information can be yielded for finding out the sorts of amino acids that can be emerged in the second and third reading frames and for obtaining knowledge of their rough existing ratios when starting from a given peptide sequence in the first reading frame.
Information concerning the amino acid combinations which can be emerged in the second and third reading frames can also be given by further providing information of the kinds of codon used, for instance, to the aforementioned “corresponding table for amino acids for each dipeptide-reading frame”. This turns out to be the same substance as the back-translation processing to the base sequences demonstrated in FIG. 2, yet it is characterized in that the volume of memory consumption can be reduced and the processing in which other information, such as information of the usage frequency of codons, is embedded can be performed.
The present invention has come to the completion based on the findings described above.