The present invention generally relates to a computer software program to design an optimum oligo-nucleic acid sequence candidate from nucleic acid base sequences being analyzed, and a method thereof.
To analyze the expression in cells of a gene that is the object of an experiment, an element called DNA chip is generally used. This DNA chip is comprised by arranging on a glass or silicon substrate, DNA fragments and/or RNA fragments having thousands to tens of thousands of different pieces of base sequence information.
This nucleic acid sequence of the plurality of DNA fragments and/or RNA fragments arranged on a DNA chip is called a capture, and is appropriately arranged so that binding, i.e., hybridization will occur with the specific gene, which is the object of the experiment. With this type of DNA chips, for instance, when a healthy cell has turned to a sick cell, it will be possible to find the expressed gene causing the illness by examining which gene in this cell has hybridization.
Here, the nucleic acid sequence of the aforementioned DNA fragments used as a capture is generally selected from a library. A library is an aggregate of DNA samples or an aggregate of cDNA samples prepared by cloning fragments of genes obtained from a cell or the like. Here, cDNA (complementary DNA) means the bases of DNA sequences that can be combined with all bases of the messenger RNA; i.e., a DNA that is synthesized complementary to the messenger RNA.
However, it is difficult in terms of time, cost and technology for researchers to obtain actual samples that will be a capture, as they would have to obtain existing DNA fragments from cells. Therefore, researchers have recently begun using a method wherein an oligo-base sequence in a length of approximately several tens of bases is determined using the sequence information on the genome whose sequence information has already been read out, or the sequence information that identifies the sequence information of the poly A sequence terminal side (poly A is a sequence present in the RNA terminal of—AAAAOH) of the messenger RNA called EST (Expressed Sequence Tag), which is chemically synthesized and mounted on a substrate. Here, an oligo-nucleic acid means a nucleic acid having a relatively short base sequence (e.g., approximately 200 base pairs).
In the past, to determine an appropriate oligo-nucleic acid sequence, researchers partially extracted genes in a library or the gene as the object of an experiment, compared these sequences through visual observation, and searched for the similarities and differences present in the sequences. However, in these years, DNA chips and DNA arrays have higher levels of integration, meaning that more fragments of nucleic acids are integrated. Searches through visual observation are not realistic any more. Thus, computers are more commonly used to determine the base sequences of the nucleic acid fragments arranged on a substrate.
As a technology to realize this method, for instance, as disclosed in Patent No. WO 94/11837, an oligo-probe design station, which can design common probes and specific probes through computer processing using the data in gene sequence data sources, has been conventionally available.
However, this type of present computer-processing technology simply computes and provides hybridization strength modeling, upon which the user selects an appropriate probe. The technology will not be able to improve the accuracy of the bond temperature of the probe.
That is, when many different probes are designed for DNA chips or for other purposes, all of these probes must exhibit the same Tm. Tm value is the temperature at which 50 percent of the strands of the double helix are hydrogen bonded. This is determined by the GC content, and other parameters. However, the GC content varies according to the base sequence and its length. Therefore, to determine a sequence that has the specific sequence in the base length determined as the synthesis condition and also yields the appropriate temperature condition, it is very difficult to determine a sequence that meets all these prerequisites.
In the technology disclosed in the aforementioned Patent No. WO 94/11837, the strength of hybridization between oligo-nucleic acid sequence candidates and the specific gene is based on the Tm, and the information is presented to the user, so the user can easily select the probe to realize the optimum temperature condition. However, as the oligo-nucleic acid sequence candidates used in this technology were initially designed without considering the Tm condition, the aforementioned process does not mimimize the degree of differences of the Tm values of the oligo-nucleic acid sequence candidates. Thus, if we try to obtain a lot of probes from the oligo-nucleic acid sequence candidates, the degree of difference will be significantly large. According to the analyses made by the inventors, the error range of Tm of the oligo-nucleic acid base sequences obtained in the prior art will be as large as ±20 degrees. On the other hand, if we try to decrease this error range, there will arise a problem that we can only obtain insufficient number of oligo-nucleic acid base sequences. Meanwhile, another application that requires determination of oligo-nucleic acid sequence is designing of probes to provide a gene amplification means in the PCR (Polymerase Chain Reaction) method, among others. In the PCR method, to search for a specific base sequence part and to amplify that particular part, suitable probe base sequences as long as several tens of bases for the initial positions at both ends of the amplified sequence must be designed. Similarly to the case of designing the base sequence for a capture, in this application also, specific sequence primers must be designed so as not to have hydrogen bonding outside of the applicable part which is to be amplified. Further, the Tm must also be maintained under the same hybridization temperature condition.
For the aforementioned purpose, the designed probe must be a specific sequence that amplifies only the desired part of the applicable gene or the intermingled nucleic acids. Further, in some cases, a plurality of sequences may need to be concurrently amplified, and in such a case, it is necessary that each sequence is an appropriate sequence for the sequence to be amplified and such amplification is performed under similar Tm conditions.
A technology related to probe designs in this PCR method has been previously disclosed. However, because of the aforementioned reasons, this technology does not offer a solution that provides the appropriate Tm condition.
Further, computer processing allows efficient determination of the sequence parts that are specific only to the nucleic acid base sequence being analyzed through concurrent inter-comparisons among a large number of nucleic acid base sequences. However, if the nucleic acid base sequences being compared include a sequence identical to the nucleic acid base sequence for an oligo-probe design, it will be impossible to determine the specific parts, thus to design the probe. In such a case, the aforementioned comparison must be repeated after determining and removing this duplicate base sequence, which means not only it takes time and trouble, but also the load on the computer was a cause for slowing down the whole processing speed.
As mentioned above, according to the prior art, there was a problem that, when a sequence is determined, which has the specific sequence in the base length determined as the synthesis condition as well as the appropriate temperature condition, it was very difficult to determine a sequence that meets all these prerequisites.
Further, according to the prior art, when a sequence identical to the nucleic acid base sequence being analyzed is registered in duplication as (i.e., has been duplicated in the database) a nucleic acid base sequence being compared, it is impossible to design an oligo-nucleic acid base sequence candidate. Thus, it was necessary to repeat the homology comparison after deleting the duplicate sequence registration.
This invention was made considering this situation. The object is to offer a system and a method, which can concurrently determine many oligo-nucleic acid sequences having a high level of accuracy in the values of Tm, GC content and base sequence length.
This invention was made considering this situation. The object is to offer a system and a method, which can concurrently determine many oligo-nucleic acid sequences having a high level of accuracy in the values of double-chain bond temperature, GC content and base sequence length.
A more detailed object of this invention is to offer a system and a method, wherein, when oligo-nucleic acid sequences are determined, the desired tolerated design range and the priority items are specified, and the oligo-nucleic acid sequences that meet the condition can be determined and displayed.
Another detailed object of this invention is to offer a system and a method, wherein oligo-nucleic acid base sequences can be determined without repeating homology comparison from the beginning, even when a sequence identical to the nucleic acid base sequence being analyzed is registered in duplication as the nucleic acid base sequence being compared.