The present invention relates to an optimization of queries being performed on a database management system (DBMS).
The advent of powerful, yet economical computers made possible by advances in processor, memory and data storage devices has made computers an integral part of modern companies. An important class of applications for these computers includes a DBMS where information is collected and organized according to a data model and searched via queries. The DBMS allows users to perform operations such as locating, adding, deleting and updating records stored in the computer without a detailed knowledge of how the information making up the records actually is stored in the computer.
One powerful type of DBMS is known as a relational DBMS where stored information appears to the user as a set of tables, each of which is termed a relation. In each relation, the information is arranged in rows and columns, with columns of data being related to each other by one or more predetermined functions.
To access particular information in the relational DBMS, a query compiler converts a user request typically expressed in a Structured Query Language (SQL) into a set of operations to be performed on one or more input relations to yield a solution responsive to the user's request.
As the volume of data stored in the DBMS grows, an efficient processing of user queries becomes important. The efficient processing of queries also is important when the DBMS is an object relational database that contains new data types other than traditional types such as text, integer, decimal, floating point and date types. For example, a product called Informix.RTM. Universal Server, available from Informix Corporation of Menlo Park, Calif., supports user-definable data types, such as graphics, video, animation, sound, 3D geometry and geospatial data types, among others. These new data types in turn require access methods and functions that operate on the new data types. These access methods and functions have introduced added difficulties for conventional query optimizers.
To minimize processing time and utilize resources more efficiently, a query optimizer typically is deployed to minimize disk input/output (I/O), memory, and processor loading, thus providing a lower expected processing cost for carrying out operations associated with the query. The query plan generated by the optimizer for accessing a particular row from a table which contains a user-defined data type may involve use of existing access methods such as sequential table scan and B-tree index scans or new access methods defined within the user-defined data type. If the query involves only a single table, the optimizer determines if the entire table is to be scanned or if any existing indices on the table are to be used. If the query involves multiple tables, the optimizer determines 1) the order in which the tables are accessed, 2) the method of access (table scanning or indexing), and 3) the order and method of joining tables.
In order for the optimizer to come up with good query plans accurate cost estimates are needed. Cost estimates typically are expressed as a combination of expected number of disk I/Os, number of rows that must be processed and the cost of evaluating the predicates for each row. If the estimates are inaccurate, the optimizer may produce inefficient query execution plans, resulting in poor search performance for these new data types.
An important metric in the optimization process, known as predicate selectivity, defined as the ratio of the number of rows in the result set to the total rows in the table. In evaluating alternate query execution strategies, the optimizer calls a selectivity estimation routine to access statistics stored in system tables relating to selectivity. Since selectivity is a function of the characteristics of the data, data type-specific statistics are periodically gathered and used for computing selectivity. For standard data types, statistics gathered typically include number of distinct values, minimum and maximum values, and the distribution of data values, expressed as a histogram. However, even for conventional data types, it is usually difficult to capture information for all relevant functions defined for the data type. For more complicated types, such as user-defined data types, the task of estimating selectivity could be cost prohibitive. If several operators and user-defined routines (UDRs) can operate on a given column of data, the optimizer needs selectivity estimates for each operator and UDR. The accuracy of selectivity estimates can have a significant impact on the query plan generated by the optimizer and in the performance of the query.
In addition to selectivity, the optimizer also considers the cost associated with different access methods, join strategies and predicate evaluation in coming up with a least expensive query execution strategy. Thus, if the new data type introduces new access methods, then associated cost formulae are needed by the optimizer. The cost per call value of functions associated with user-defined data types is usually fixed and can be measured during the initial testing period of each user-defined data type.
Because the selectivity and cost per call information for each user-defined data type depend on factors such as the distribution of values of the underlying column and values specified in the queries, deriving the selectivity function or the cost per call function during the design, implementation or testing of the user-defined data type is non-trivial. Hence, users may neglect to provide the functions for generating per call cost and selectivity estimates, which in turn causes the optimizer to use arbitrarily determined default estimates such as zero or one. The use of default estimates inaccurately captures dynamics in the DBMS, leading to inefficient query strategies.