During the process of accessing a database and searching it, in order to conduct a query, there is need for the database to establish an access plan. The access plan defines in what order tables are accessed, which indexes are used, and what joining method is used to link data. A good access plan is very important for Structured Query Language (SQL) statements to be rapidly executed. Most Relational Database Management Systems (RDBM) create access plans using cost based optimizer. The cost based optimizer estimates the costs of each candidate access path according to the statistics information in the database, and selects the access path with the minimum cost as the optimal path. The above mentioned cost mainly comprises Central Processing Unit (CPU) execution cost, Input/Output (IO) cost, and so on.
When calculating the costs of each candidate access path, the optimizer performs the estimation mainly according to the statistics information related to tables and indexes in the database as well as filter factors of a query statement. A filter factor is, when “filtering” with a specified condition in a query statement, the ratio of data entries selected from a specified range of a database to the total data entries of the specified range. The filter factor is very important for the optimizer to create an effective access plan.
In a very simple example, suppose that a database comprises a Sales Information Table T for a computer accessory consisting of 100 rows, in which 95 rows relate to Model A and 5 rows relate to Model B, and the model information is recorded in the 2nd column. If a query is conducted using the query statement SELECT*FROM T WHERE COLUMN—2=‘A’, that is, filtering by using Model=A as the predicate, 95 data entries will be returned and the filter factor will be 95%. For such a filter condition and filter factor, the optimizer of the Database Management System can conclude from simple calculation that it is an economic access path to directly scan and judge each row of Table T, and read the data entries where the 2nd column is A. In another case, if filtering is performed using Model=B as the predicate, 5 data entries will be returned and the filter factor will be 5%. In this case, the optimizer can conclude from calculation that it is a preferable access path to first read the index of the table, and then directly access the data entries where the 2nd column is B according to the pointer recorded in the index. Thus, it can be seen that the filter factor plays a very important role in the process of the optimizer creating an access plan. The optimizer usually creates different access paths according to different filter factors. Furthermore, the value of the filter factor is closely dependent on the distribution of data in tables and the literal value of the filter condition designated in a query statement, i.e. ‘A’ or ‘B’ in the above example.
In many cases, however, the filter factor cannot be obtained simply and directly like the above examples. For example, in the case of typical embedded SQL, the SQL statement is embedded into a host application, and cites in the query statement the host variable(s) or parameter(s) of the host application, for example, SELECT*FROM T WHERE COLUMN—2=var1, where var1 is a variable. Hence, the literal value in the filter condition and the value of the filter factor are directly dependent on the value of the host variable(s). However, generally, the values of the host variables cannot be determined until the application is executed. In addition, in some cases, the application executes the same query statement repeatedly for many times, but each time it executes the query statement, the value of the variable generated may be different. As there is no way to precisely determine the literal value of the filter condition before the query statement is executed, many methods have been proposed to estimate filter factor in order to optimize the access path.
In an access path optimizing solution, the optimizer estimates a possible filter factor based on the assumption that the values of the variables are evenly distributed, and creates an economic access path based on the estimated filter factor. After determining the access path for the query statement, each time the query statement is executed, database is accessed according to the pre-established access path. Still referring to the above example of Sales Information Table T for a computer accessory, for the query statement SELECT*FROM T WHERE COLUMN—2=var1, since the 2nd column contains two possible values A and B, the optimizer may averagely deem that the filter factor of the query statement is 50%, and create a constant access path based on the filter factor. It can be seen that, in this optimizing solution, the access path is predetermined, and thus it has the advantages of stability and predictability. However, in the case that the distribution of data in tables deviates from the assumption of even distribution, the filter factor estimated by the solution will not be precise enough, thus obtaining an access plan with low efficiency.
In another optimizing solution, the optimizer supposes that the first literal value generated when the query statement is executed is more representative, and thus the estimated filter factor is closer to the actual situation compared with the supposition of even distribution. Accordingly, the optimizer creates access path based on thus estimated filter factor, and caches the created access path for use in subsequent execution of the query statement. Hence, each time the query statement is executed, the database is accessed according to the access path in the cache. Referring again to the above example of Sales Information Table T and the query statement SELECT*FROM T WHERE COLUMN—2=var1, if var1 is given the value of B when the query statement is executed for the first time, the optimizer will create access path based on the filter factor 5%. In this optimizing solution, it can be understood that, if the literal value B generated for the first time is really representative enough, or in other words, the occurrence frequency of the literal value B is large enough in the subsequent query execution, the created access path will be effective for the subsequent execution of the query statement. However, if the literal value generated for the first time happens not to be the most frequently used value, for example, all the var1 values are A during subsequent execution, the access path created by the optimizer will cause the query statement to be subsequently executed with very low efficiency. Therefore, the execution performance of this optimizing solution fluctuates significantly, and is not stable enough.
In another optimizing solution, the optimizer performs dynamic optimization during the execution of query statement. Each time the query statement is executed, the optimizer estimates filter factor according to the actually generated literal values, and then creates an economic access path. This optimizing solution is effective to find out the optimal access path. However, as it has to re-estimate the filter factor and re-create the access plan every time, the cost of executing the optimizing solution may even possibly exceed the cost of executing the query, and the generated execution expense is often beyond a tolerable range. In addition, constantly generating or changing the access path would make it difficult to monitor and trace the access path.
In yet another optimizing solution, the optimizer re-estimates filter factor each time the query statement is executed, and judges whether the previously created access path is suitable for the newly obtained filter factor. If the filter factor is not compatible with the cached access path, a new access path will be created. Although this solution reduces some expenses compared with the solution of re-optimizing every time, it still has almost all the disadvantages. As the optimizer still needs to re-estimate filter factor each time the query statement is executed and re-optimize the access path if necessary, the cost is still high.
Therefore, a better optimizing solution is desired to improve the optimization performance of the database optimizer.