The present disclosure relates to access path optimization in a database, and more particularly, to a method and apparatus for estimating a filter factor used for access path optimization in a database.
During the process of accessing a database and performing searching, in order to conduct a query, there is need for the database to establish an access plan. The access plan defines in what order tables are accessed, which indexes are used, and what joining method is used to link data. An access plan may be important for structured query language (SQL) statements to be rapidly executed. Most relational database management systems (RDMS) create access plans using a cost based optimizer. The cost based optimizer estimates the costs of each candidate access path according to the statistics information in the database, and selects the access path with the minimum cost as the optimal path. The above mentioned cost mainly comprises central processing unit (CPU) execution cost, input/output (I/O) cost, as well as other factors.
When calculating the costs of each candidate access path, the optimizer performs the estimation mainly according to the related statistics information of tables and indexes as well as filter factors of a query statement. A filter factor means, when “filtering” with a specified condition in a query statement, the ratio of data entries selected from a specified range of a database to the total data entries of the specified range. The filter factor may be important for the optimizer to create an effective access plan. In one example, assume that a database comprises a Sales Information Table T for a computer accessory consisting of 100 rows, in which 95 rows relate to Model A and 5 rows relate to Model B, and the model information is recorded in the 2nd column. If a query is conducted using the query statement SELECT*FROM T WHERE COLUMN—2=‘A’, that is, filtering by using Model=A as the predicate, 95 data entries will be returned and the filter factor will be 95%. For such a filter condition and filter factor, the optimizer of the Database Management System can conclude from simple calculation that, it is an economic access path to directly scan and judge each row of Table T, and read the data entries where the 2nd column is A. In another case, if filtering using Model=B as the predicate, 5 data entries will be returned and the filter factor will be 5%. In this case, the optimizer can conclude from calculation that, it is a preferable access path to first read the index of the table, and then directly access the data entries where the 2nd column is B according to the pointer recorded in the index. Thus, it can be seen that the filter factor plays a relatively important role in the process of the optimizer creating an access plan.
In many cases, however, the filter factor cannot be obtained simply and directly like the above discussed example. For example, in the case of a typical embedded SQL, the SQL statement is embedded into a host application, and cites in the query statement the host variable(s) or parameter(s) of the host application. Hence, the precise selection of the filter condition and the value of the filter factor are directly dependent on the value of the host variable(s). However, as the variables in the host application, the values of the host variables can be determined only when the application is executed. Since there is no way to precisely determine the values of these variables in advance, the optimizer has to estimate a possible filter factor based on the assumption that the values of the host variables are evenly distributed, and perform the access path optimization based on the estimated filter factor. It could be understood that if the values of the host variables generated when the host application is executed deviate from the assumption of even distribution, the estimated filter factor will not be precise enough, thus obtaining an access plan with low efficiency.
In order to solve the problem, in some solutions, during the process of executing the host application, the access path will be optimized once again by using the actually generated values of the host variables. However, the cost caused by the re-optimization in the process of execution may sometimes be unacceptable, or may affect the execution performance of the application. Therefore, in most cases, the access path for an SQL statement is still created statically before the runtime, instead of being optimized dynamically.