As the field of biology has developed, research on biological genes has gone deeper and deeper into various aspects such as human health, medicine research & development, new plant and animal species and microorganisms.
By sequencing a gene of an organism, it is possible to obtain a sequence of base pairs composing the chromosome of the organism. Usually the process of measuring a gene sequence of the first sample of a species is referred to as sequencing, while the process of measuring a gene sequence of other sample of the species is referred to as re-sequencing. A breakthrough has been achieved in sequencing and re-sequencing technologies. With various involved costs going increasingly lower, more and more individuals or organizations come to realize the significance of gene sequence, and so far gene sequence data of a large amount of species have been obtained through a sequencing/re-sequencing process.
Gene sequences include a large amount of data. As an example, human genes include about 3 billion base pairs or 6 billion individual characters (i.e. A, G, T, and C) according to existing representation modes. Therefore, each gene sequence stored in the gene database will take up a large amount of storage space. When there is a need to store a large amount of gene sequences or to copy and transmit the gene sequences, challenges arise regarding the data storage/data transmission efficiency.
Biologists have found similarities between gene sequences of various samples of the same species. For example, the similarity between human gene sequences is much higher than the similarity between gene sequences of humans and other species. Similarly, the similarity between gene sequences within one race is usually higher than the similarity between gene sequences of different races. Based on the similarity, there has been proposed a concept of reference gene sequence, which can be a representative typical gene sequence that has been obtained during past data processing.
For example, in human beings, gene sequences of males of a particular race might have some common parts controlling skin color, hair color and gender, which might be identical or only contain slight differences. Therefore, the gene sequence of a given male of a particular race can be used as the reference gene sequence. When the gene sequence of another male of the same race needs to be stored, it can be compared with the reference gene sequence, and only difference data among these two gene sequences and an identifier of the reference gene sequence have to be stored. Thereby, the data amount to be stored can be reduced greatly, and the objective of data compression can be achieved.
Note that since many parts in gene sequences of males of the same race are identical and the proportion of difference data is not very high, the above method can significantly reduce the data space for storing gene sequences. Therefore, a large quantity of reference gene sequences can be stored in the reference data repository, and a reference gene sequence that best matches a to-be-stored gene sequence can be selected from the reference data repository based on similarity search. However, due to characteristics of each gene sequence such as large data amounts and various combinations of individual base characters, the existing similarity search algorithm is not well suited for gene sequences.
How to determine similarity between two gene sequences is a basis for selecting a reference gene sequence and other subsequent treatment in the technical field of gene sequence treatment. Therefore, it now becomes a research focus in the gene sequence treatment field regarding how to provide a more effective method for determining similarity on the basis of features of gene sequences.