1. Field of the Invention
The present invention relates to a list process system and a method for processing a plurality of lists, each of which is composed of a plurality of data, and for extracting features thereof.
2. Description of the Related Art
Recent advancements in biotechnologies strongly influence our society, day by day. In particular, genetic engineering that involves gene recombinations of deoxyribonucleic acid (DNA), of which genes are composed, and protein engineering that synthesizes new proteins from existing proteins, have advanced remarkably.
DNA is a high molecular compound made up of nucleotide repeating units composed of a base, sugar (deoxyribose), and phosphoric acid. The bases that compose DNA are categorized in four types, namely: adenine (A), thymine (T), cytosine (C), and guanine (G). Between the nucleotides, the deoxyribose and phosphoric acid are linked in a double-stranded helix structure.
The bases are linked according to rules. Specifically, A and T are linked. In addition, C and G are linked. The sequence of these bases in DNA determines the type thereof (namely, the type of genes).
Since the sequence of bases of DNA records genetic information, in genetic engineering, a technique for exactly and quickly decoding a given complicated sequence of bases of DNA (gene sequences) is required.
A protein is a high molecular compound in which a lot of different amino acids are concatenated in a chain by peptide linkages. Polypeptide composed of only amino acids is called a simple protein. A compound of amino acids, nucleic acids, carbohydrate, phosphoric acid, and so forth is called a conjugated protein. A variety of functions of proteins depend on the sequence of amino acids that form polypeptide linkages, geometric disposition of the polypeptide chain, or the like. Thus, in protein engineering, a technique for exactly and quickly analyzing a sequence of amino acids of a given protein is required.
To determine the characteristics of a given gene and a given protein, the sequences are compared with all available sequences stored in databases. So the homologous sequences with the genetic sequence and so forth is obtained. In the methods for sequence databases search, similar regions are searched from the beginning of two sequences to be compared. In addition, the similarities of each region are calculated so as to evaluate the entire similarities of the sequence.
However, the method for the sequence database search has not been established. At the present time, various techniques are used in combination and the results are compared. In addition, there are many sequence databases that are used for the search. Thus, the results of the search for the same sequence data vary depending on the used method, database, parameters, and so forth.
Conventionally, the search is repeatedly performed with a combination of several methods, parameters, and so forth. By comparing the results, retrieved sequences are discarded or selected so that the most suitable searching method and parameters are obtained.
The results of the search are individually output in a list format and the similarities and differences thereof are manually determined.
In the conventional searching method, when the results of the search are few in number, they can be manually processed. However, as the number of results increases, the processing time becomes long and errors increase.
Devices that automatically read sequences of genes and amino acids have been widely used. In addition, as a result of big projects, such as the human genome project, that have been performed for decoding gene information, the amount of sequence data has increased remarkably. Thus, the amount of data of individually searching reaches the level that cannot be manually processed.