The sequences of nucleotides (or bases) in pairs of polymer strands constituting the DNA molecules of humans and other organisms (animals, plants, microorganisms, etc.) are being deciphered worldwide. In order to record the deciphered nucleotide sequences, four kinds of nucleotides which constitute DNA are expressed in four different one-byte (eight-bit) text data by allocating the character A, G, C, or T for the nucleotide including adenine, guanine, cytosine, or thymine respectively as the nitrogenous base. Consequently, sequence information on DNA which consists of two polymer strands with each strand comprising n (n is an integer) nucleotides is represented in n-byte text data by expressing each nucleotide of one strand one by one as the corresponding character selected from the four characters A, G, C, and T (or a, g, c, and t). Similarly, the sequence of n nucleotides constituting an RNA molecule is recorded in n-byte text data by allocating the character A, G, C, or U (or a, g, c, or u) for the nucleotide including adenine, guanine, cytosine, or uracil respectively.
In the case of humans, since each chain of the DNA molecules in the first or largest chromosome and in the 22nd or smallest one is a sequence of nearly 250,000,000 and 50,000,000 nucleotides respectively, the nucleotide sequence of the DNA in each chromosome can be expressed in about 250-50 MB text data. In addition, since the human genome (all DNA information) is expressed as the sequence of nearly 3,000,000,000 nucleotides, it is recorded in about 3 GB text data. For practical uses, the original text data may be recorded or transmitted as a compressed file of about half the size of the original data by applying the conventional file compression techniques.
Following the decipherment of nucleotide sequences of DNA, the functions of the proteins synthesized according to the genes in DNA are widely researched. In these researches, the sequence of a protein molecule which consists of n amino acids is represented by n-byte text data since each of 20 kinds of amino acids constituting protein molecules is expressed as the text data of three characters (for example, Ala, Cys, Glu, etc.) in three-Letter Code or one character (for example, A, C, E, etc.) in one-Letter Code. As ordinary proteins consist of the sequence of nearly 20 to 1000 amino acids, each of the sequences of those proteins may be recorded in about 1 KB text data, at the most. Moreover, it is estimated that there are nearly 30,000 human genes in total and there may be nearly 100,000 kinds of protein molecules including theoretical ones.
As described above, in order to record the human genome in the form of text data, about 3 GB of memory is necessary. Even if the conventional file compression techniques are employed, nearly 1 GB of memory may be needed. Recently, DNA sequences of living organisms other than humans such as colon bacilli and various viruses are also disclosed to the public. If these DNA sequences are collected in text data, we may need several hundred MB of memory for each of those organisms. Such is the case in recording sequence information on RNA.
Thus, when information on DNA sequences of humans or other organisms is recorded in the form of text files or the conventional compressed files, the recording medium with huge memory capacity such as a DVD-ROM disk capable of recording nearly 5 GB data is necessary. There is additionally an inconvenience that both the time needed for reading sequence information from the recording medium and the time needed for processing sequence information are long.
Moreover, since the transmission rate of the current general communications network is about 5 Mbps, when we transmit information on DNA sequences of the size of, for example, 1 GB via the communications network, the transmission time will be around thirty minutes. Especially recently the digital cellular phone system is being widespread as a communications medium. It may however be difficult to use it to transmit at least the DNA sequence information of humans since the transmission rate of the present cellular phone system is as low as nearly 1 Mbps.
There is also a problem of how to assure that the nucleotide sequences, which are assumed to be equal and held by two or more researchers as a standard sequence, are really equal. This happens, for example, when genes in the DNA of a certain microorganism are studied by the researchers. That is, it is not necessarily easy for two or more researchers to mutually verify in a short time that their text data expressing the nucleotide sequence of the DNA are completely equal when each of their text data has several MB data (data for several million characters).
In this connection, as a use of information on DNA sequences of humans or other organisms, we can think of a task to search the difference between a standard DNA sequence and a sample DNA sequence. Such a task will be needed when the SNP (Single Nucleotide Polymorphism) is searched. However, there is an inconvenience that a relatively long time is needed to compare the two text data and search the difference between them when both text data represent the huge nucleotide sequence of DNA.
Furthermore, a new business has started in which several suppliers offer many pieces of information on DNA sequences to users such as researchers of the pharmaceutical companies. In the business it is preferable for the suppliers to avoid offering overlapping information to the users. It is thus convenient for the users to be able to check easily whether the nucleotide sequences of DNA offered by the plural suppliers are equal or not without disclosing the entire information on the nucleotide sequences to the public. In addition, when the suppliers offer the users the DNA information through, for example, a communications network, a business model is needed in which necessary information can be transmitted to the users in as less data as possible so as to shorten the transmission time. Moreover, it is preferable that the users can easily check whether the offered DNA information contains transmission errors, etc. The above-mentioned problems are included similarly in treating information on nucleotide sequences of RNA.
In addition, the amino acid sequence of a protein is recorded by the text data of about 1 KB at the most and there are about 100,000 kinds of proteins including theoretical ones. Thus, if we express sequence information on all kinds of proteins in the form of text data, we will have a large amount of data. Accordingly, it is preferable to record the sequence of each protein in as less data as possible and we need a system by which we can easily verify whether two pieces of sequence information on proteins are equal.
It is therefore an object of the present invention to provide a method and device for recording approximate small amounts of data of sequence information on biological compounds such as a set of nucleotides of nucleic acids or a set of amino acids of proteins.
It is a second object of the invention to provide a method and device for detecting the difference between two pieces of sequence information on biological compounds by a small amount of data and, if necessary, recovering the difference.
It is a third object of the invention to provide a business model (or a method for supplying information) for making a user easily verify whether the user's data and the supplier's original data are equal and detect the difference between them using a small amount of data when supplying sequence information on biological compounds such as a set of nucleotides or a set of amino acids to the user.
It is a fourth object of the invention to provide a computer-readable medium in which approximate information on sequences of biological compounds is recorded with a small amount of data.