Database management systems (DBMS), particularly relational DBMSs, are widely used. Conventional local DBMSs, such as DB2, utilize local data sources. Such conventional local DBMSs generally include conventional query optimizers used to generate a query execution plan for a particular query. The query execution plan determines how the particular query will be executed by the conventional local DBMS. In order to generate the query execution plan, the conventional query optimizer generates a number of prospective query execution plans and costs each of the prospective query execution plans. The cost of a particular query execution plan is effectively the time to execute the query using the query execution plan. The conventional query optimizer selects the prospective query execution plan with the lowest cost as the query execution plan for the query.
To determine the cost of a prospective query execution plan, conventional query optimizers use the cardinality for the prospective query execution plan. The cardinality is generally a major factor in determining the time, or efficiency, for executing the query. The cardinality is the number of rows processed at each intermediate step of a query execution plan. In order to determine the cardinality for the prospective query execution plan, the conventional query optimizer utilizes statistics for the conventional relational DBMS. The statistics might include the number of rows in a table, the number of distinct values for a column, histograms of the distribution of data values in a column, the number of distinct index keys, and the most frequent values in a column. Advanced conventional query optimizers may also use joint statistics on groups of columns in order to deal with possible correlations between column values. In addition, many query optimizers also utilize statistics for other parameters in determining the cost.
Although conventional query optimizers can formulate query execution plans, one of ordinary skill in the art will recognize that erroneous database statistics can cause the conventional query optimizer to improperly estimate the cardinalities. Consequently, conventional query optimizers may incorrectly determine the cost of a prospective query execution plan. This erroneous determination may result in a poor choice of query execution plan and, therefore, unacceptably long processing times for queries. Various conventional mechanisms exist for accounting for changes in statistics and improving selection of a query execution plan. However, such conventional mechanisms may be restricted to local databases.
Conventional federated Database Management Systems are conventional DBMSs that are able to interface with independent, external data sources and provide a relational view over remote data. Such external data sources might include independent instances of the same (local) database, third party relational databases and also non-relational data sources like spreadsheets and flat files. An example of such a federated DBMS includes the WebSphere Information Integrator. In a conventional federated DBMS, a query execution plan is still developed using a conventional query optimizer configured for a federated DBMS. In order to generate a query execution plan, the conventional query optimizer formulates prospective query execution plans, costs the query execution plans, and selects the query execution plan having the lowest cost, in a similar manner as for a conventional local DBMS. However, for a conventional federated DBMS, the conventional query optimizer also determines those portions of the query execution plan that will be executed by the external data sources. In particular, the conventional query optimizer considers both the cost of executing portions of the query at each external data source as well as the additional costs of the federated overhead. The conventional query optimizer determines the cost of executing portions of the query at a particular external source using the statistics on the remote data to estimate the cardinalities of the results that will come back from the external data source. Thus, whether a portion of the query is to be executed on the remote data source depend on the cost of executing the portion of the query on the external data source versus the cost of executing the portion of the query locally.
Once the portions of the query to be executed locally and remotely, by the external data source(s), are determined, SQL statements for the portions of the query being executed remotely are generated for the appropriate external data sources. The statements are executed and the resultants returned to the conventional federated DBMS. The portions of the query to be locally executed are also performed. Thus, the query can be executed by the conventional federated DBMS.
Although a conventional federated DBMS can execute queries using query execution plans, one of ordinary skill in the art will recognize that there are barriers to efficient execution of queries in a conventional federated DBMS. The usage, communication and synchronization of statistics between the federated server of the conventional federated DBMS and the remotely accessed data sources as well as incompatibilities in the statistical models used by the federated server and the external data sources may compromise selection of an efficient query execution plan. In particular, these issues may adversely affect the accuracy of the statistics for the external data source used by the conventional query optimizer in generating the query execution plan. This is true even though the conventional federated DBMS has a local mechanism for utilizing statistics to update costing of the prospective query execution plans.
Accordingly, what is needed is a method and system for improving the efficiency of query execution in a federated DBMS. The present invention addresses such a need.