1. Field of the Invention
The present invention relates to an integrated database system that integrates databases and arbitrary data sources, such as database and web servers which are connected via networks, to perform the query processing. More particularly, the present invention relates to a system that optimizes a query issued to an integrated database.
2. Description of the Related Art
The Human Genome Project, which was initiated in 1990 in an effort to obtain all human DNA sequence data in about 15 years, published in June 2000 the draft sequences accounting for about 85% of all human genomes, due to the accelerated sequencing speed realized by the rapid advance of the sequencing technology and by the application of the large-scale rearrangement method (shotgun cloning) using the parallel computers. In parallel with the Human Genome Project, other projects are also under way to map the DNA sequence data from various biological species other than humans. These projects have accumulated in the individual databases (bio-information databases) in an orderly manner decoded data on the amino acid sequences and the three-dimensional structures of protein in living organisms and data on metabolic pathways. These databases are in many cases managed by the public organizations, and can be accessed their contents by issuing a query via the Internet.
A variety of analysis tools extracting information obtained by deriving from the sequence data (for example, by estimating a gene coding area from a large volume of DNA sequence data, or by estimating the three-dimensional structure of protein) are being proposed one after another. Some of these tools are made public in the form available on the Internet.
The methods for obtaining the experimental data in the laboratories have also undergone drastic changes. The techniques to retrieve a large amount of data with high throughput have been devised, including the DNA micro-array method capable of measuring the abundance (or representation) of many genes in individual cells simultaneously. With the new developments in the measuring process, a huge volume of experimental results are being stored in the laboratories. Under these circumstances, it is important from now on to utilize a wide range of databases and tools in combination to understand what roles the genes and proteins encoded in the sequence data play in living organisms and in what way they are related to each other, and to apply findings to the fields of medicine manufacturing, medical care and foods.
For the understanding of the complex biological phenomena, it is essential to perform the query processing combining these databases and tools to analyze the retrieved data. This, however, is accompanied by the following difficulties.    (1) The formats or structures of data stored in the databases differ from one database to another, and these databases also have no unified form of the executable query. It is therefore difficult to easily issue a query that simultaneously combines the databases to use the combined databases.    (2) Since the different databases have the different query capabilities as to the data coverage and the description level of data stored in the databases, it is difficult to decide which databases should properly be linked and queried.
With the advance of technology, the number of new databases and tools that can be used in combination is rapidly increasing. However, since they are individually maintained and made available for public access, the above two aspects are not taken into consideration, and it takes an enormous amount of time and labor to use the new databases in integration with the existing ones.
For the efficient analysis of bio-information, it is therefore important to build an integrated database system that can easily issue a query capable of using a plurality of directly irrelevant databases and tools in combination, and can execute such a query efficiently.
In order to perform the efficient query processing in the integrated database system that integrates a plurality of databases, it is important to realize the query optimization mechanism that provides the integrated interface for cross-linking a plurality of external databases having different data formats, and converts the issued query into the efficient query plan to execute the query plan.
There are two following schemes as the conventional query optimization scheme in the integrated database system. That is, as the first scheme, there is the approach in the wrapper mediator system disclosed in “Capability Based Mediation In TSIMMIS” in “ACM SIGMOD International Conference On Management of Data (SIGMOD' 98)” (published by ACM Press), p.564-566, and in U.S. Pat. No. 5,588,150. As the second scheme, there is the approach in the multiagent system disclosed in “Multiagent Systems” in Chapter 12 of “Foundations of Intelligent Knowledge-Based Systems” (published by Academic Press) and in JP-A-11-85522.
In the wrapper mediator type integrated database system according to the first conventional scheme, the individual external databases are provided with the programs (called the wrappers) for transforming the query and data format into acceptable ones to the databases. The mediator combines the appropriate wrappers and provides a unified query interface to the wrappers. Thereby, the user can access a plurality of databases through a single interface. Here, each of the wrappers declares the query class acceptable to the wrapper itself, and registers it with the mediator. When a part or the whole part of the thrown-in query is included in the query class declared by the wrapper, the processing of that part of the query can be entrusted to the wrapper. The mediator determines whether or not it entrusts the query processing to the wrapper, based on the estimating cost of the query processing on the wrapper side and so forth.
Generally, there can be considered many alternative query plans for processing the query, which is thrown into the integrated database system, using the external databases, in terms of the combination of the external databases to be used and the order of queries. These query plans have different characteristics as to the execution cost and the data contents obtained as the query results. In the first conventional scheme, since one of the query plans for processing the thrown-in query using the external databases is selected to be executed, there is a possibility that the query results obtained may be fewer than ones that can originally be obtained using the external databases.
For example, when it is attempted to collect a set of all genes contained in human genomes using the databases currently made public, the following query methods (1) to (3) can be conceived, and the contents of the query results greatly differ.    (1) Selecting the genes, which are clearly shown to be human's ones, from the gene data registered in the gene database;    (2) Extracting by applying the gene estimation tool to the human genome data registered in the sequence database; and    (3) Finding the description portions concerning the desired gene from the documents registered in the document databases to determine the target data based on the name of the gene referred to in the description portions.
Therefore, the query optimization scheme according to the first conventional scheme is not appropriate as the integrated database query optimization scheme in the field of bio-information where many databases overlap one another in terms of the stored data and the query capability, and where there are many different ways of combining the databases.
The multi-agent type integrated database system according to the second conventional scheme comprises: the external agents each of which capsules the individual data sources and the query capabilities for the data sources; and the coordinate agent for accepting the thrown-in query to forward it to the external agents. Each of the external agents registers in the coordinate agent in advance the query class that the external agent itself can be handled. For the query issued from the user, the coordinate agent trusts the query processing to the appropriate agent that can handle the query, according to the registration contents registered by the external agents. Here, the coordinate agent may transform the query and the data format so that the associated external agent can process the query, as required.
As described above, the second conventional scheme differs from the first conventional scheme in that any single interface is not provided to the user. But, it is possible to issue the query with relatively easy by having the coordinate agent conceal the difference in the data format and/or the difference in the query capability of the individual data sources.
However, in the second conventional scheme, one combination set of the destination agents for the thrown-in query is also determined according to the inclusion relation of the query processing capabilities of the agents to be executed. Hence, as the first conventional scheme, there is a possibility that the query results obtained are fewer than ones that can originally be obtained using the available external databases. For this reason, this scheme is not appropriate as the integrated database query optimization scheme in the bio-information field.
In the conventional schemes described above, the thrown-in query is executed by selecting one of some query plans for combining and executing the query processing in the external databases. Therefore, there is a possibility that the query results obtained by executing the selected query plan may be fewer than ones that can originally be obtained using the external databases.