Recently, enormous number of nucleic-acid sequences (herein, referred to as “base sequences”) and amino-acid sequences of proteins is known, therefore, database preparations are carried out throughout the world. In most cases, for the newly discovered sequences, identifiers consisting of 6 to 10 alphabets and figures are assigned, and stored in databases together with strings of characters representing the sequences. However, since there are many cases in which identifiers irrelevant to sequences are routinely or arbitrarily assigned to the sequences by analysts and database-preparing organizations, it often happens that different identifiers are assigned to the same sequence and the same identifiers are assigned to different sequences. Accordingly, for the purpose of judging whether or not the same sequence or information related to the same sequence exists in a database, the conventional identifiers cannot be reliably used, and it is necessary to compare several hundreds to thousands of residues for each of known huge number of sequences in a database.
Base sequences and amino-acid sequences are equal information to chemical formulae which specifies structures of substances such as DNA, RNA, peptide and proteins (FIG. 1 {circle around (1)}). Sequences are, in general, information on kinds and connection orders of bases or amino-acids (herein, referred to as “residues” comprised in those substances. Generally, one sequence specifies one substance, however, there are cases in which a sequence specifies multiple substances, for example, a residue “purine” meaning either adenine or guanine can be used for specification of kinds of residues.
A base sequence (or an amino-acid sequence) is usually represented by a string of characters. Usually, each residue is represented by a one-letter or three-letter character as an description unit. However, depending on the notation, the same sequence can be represented by different strings of characters. A string of characters arranged in the connection order of the residues represent the sequence. Herein, a string of characters representing a sequence is “data representing connection order of residues in the sequence” which is one of possible representations of the order of residues in the sequence. For example, an amino-acid sequence in which alanine, leucine and glycine are connected in the order can be represented by “AlaLeuGly” in three-letter notation as shown in FIG. 1 {circle around (2)}, or can be represented by “ALG” in one-letter notation as shown in FIG. 1 {circle around (3)}. These strings of characters are different representations (different in terms of data items) of the same sequence.
In organisms, there exist huge kinds of substances which can be specified by base sequences and amino-acid sequences. Strings of characters representing sequences and information related to the sequences are stored in databases.
If a substance is available, it is possible to determine a connection order of residues by using analytical instruments like sequencers, consequently, a base sequences or an amino-acid sequences is determined and represented as a string of characters representing the sequences, regardless of analysts and analytical sites. Identicalness of sequences can be judged by comparing strings of characters transformed to standard representations. Usually, strings of characters representing sequences are included in data records in databases. Whether different data records contain the same sequence or not is finally judged by the comparison of the standard representation of the sequences in the data records.
Data records containing sequences are available to anyone via the internet from GenBank, EMBL, DDBJ, SWISS-PROT and others. And many published patents and documents contain sequences. In the data record, in addition to strings of characters representing sequences, information related to the sequences, such as original organisms of the sequences, definition of segments within the sequences and features of the segments are filed, where a “file” means a form of data record. Identifiers which must be assigned to sequences uniquely are prone to be used as identifiers assigned to the entire information in the files. This is due to the lack of procedure to assign specific identifiers to sequences. Herein, “unique” means one-to-one correspondence. “Specific identifiers” are unique and consistent identifiers. “Consistent” means that identifiers of the same sequence must be same among all databases. It is always easy to assign unique identifiers to sequences in each database independently, but it is difficult to assign the same identifier to the same sequence in all databases.
It is often the case that different data records are found to contain the same sequence. For example, only information related to the same sequence, such as the original organisms from which the sequence was found, is different among the data records. Biologically, this means that the same sequence was found from different organisms, therefore, the difference of the information are contained in the different data records on purpose. However, since either one of identifiers assigned to the data records is often arbitrarily used as the identifier of the sequence, specific identifiers of sequences are necessary.
There are many data records to which identifiers of clones are assigned. For example, identifiers of clones of cDNA library from which base sequences were found are assigned to the data records containing the base sequences. It is often the case that a base sequence was redetermined from the clone. In this case, the former sequence recorded in the data record would be revised to the redetermined sequence which is often different from the former. That is, before and after the revision of data records, sequences corresponding to the same identifier are altered. Since this kind of revision is often performed, it is troublesome to use those identifiers as reference keys used for describing the information related to the sequence. “Reference key” means a name or a key which specifies the sequence. Specific identifiers play the same role as reference keys under ordinary circumstances. Therefore, specific identifiers of sequences are necessary.
Since assigning methods of identifiers differ from database to database, it is not possible to judge the identicalness of sequences or segment/segments of sequences based only on the comparison of their identifiers only. Therefore, the only ways to judge whether sequences contained in data records among different databases are the same or not are either to compare strings of characters representing the sequences or to depend on link information indicating relations among identical sequences. Considering the fact that more data records containing sequences will be registered in independent databases in the future, it is desirable to establish a method of generating identifiers based on data which uniquely specify the sequences and to used it uniformly in all databases to maintain the consistency of identifiers among all databases.