A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to information processing environments and, more particularly, to improved system performance during retrieval of information stored in a data processing system, such as a Relational Database Management System (RDBMS).
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
RDBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft(trademark) clients connected to one or more Sybase(copyright)) Adaptive Server(trademark) (formerly Sybase(copyright) SQL Server(trademark)) database servers. Both Powersoft(trademark) and Sybase(copyright) Adaptive Server(trademark) are available from Sybase, Inc. of Emeryville, Calif. As the migration to client/server systems continues, each day more and more businesses are run from mission-critical systems which store information on server-based SQL database systems, such as Sybase(copyright) Adaptive Server(trademark). As a result, increasingly higher demands are being placed on server-based SQL database systems to provide enterprise-wide decision supportxe2x80x94providing timely on-line access to critical business information (e.g., through xe2x80x9cqueriesxe2x80x9d).
At its core, every RDBMS system includes certain modules which perform basic tasks, including a parser, an optimizer, an execution engine, and a data manager. A parser reads client statements and transforms them into an internal representation. An optimizer takes the internal representation of the statement and looks at several alternative strategies for obtaining the correct response, an xe2x80x9canswerxe2x80x9d from the underlying database. The choices made by the optimizer have a profound impact on a system""s response time for the client. Improper choice, for instance, can delay response time by seconds, minutes, or hours. The job of the optimizer is, therefore, to make the best choice using estimations based on the xe2x80x9ccostxe2x80x9d of various strategies. The execution engine employs the execution strategy formulated by the optimizer to obtain the correct response and give the results to the client. During operation, the execution engine submits requests to the data manager to obtain information from tables. This is done in a manner that was determined by the optimizer, for instance, using available indices, performing table scans, or the like.
In today""s information-based economy, on-line database systems are critical for running the day-to-day operations of a business, whether for decision support or for on-line transaction processing. Accordingly, there has been great interest in the area of improving the speed by which these systems execute database queries. The underlying performance of a database system is closely tied to its optimizer, which, in turn, is closely tied to the cost estimates which the optimizer adopts. Consider, for instance, a cost estimate of an optimizer which inaccurately predicts that a particular operation requires only a few seconds, when in fact the operation takes minutes or hours. This type mistake is often magnified in the context of a complex query, where the particular operation might occur hundreds or thousands of times. The end result of the mistake is unacceptable system performance. If, on the other hand, the accuracy of the estimates of the cost of a particular strategy provided by the optimizer is improved, the predicted performance of the final execution plan will be more accurate. In this case, the result is better performance of the RDBMS system. The system exhibits better throughput and response time for queries, including DSS (Decision Support System) queries.
The cost estimates provided by optimizers in present-day RDBMS systems are not particularly accurate. This results in poor execution plan strategies being selected. Attempts to address the problem have focused on xe2x80x9cworkaroundsxe2x80x9d for poor optimizer plan selection. Here, systems allow a DBA (Database Administrator) to explicitly override the optimizer""s selection with a xe2x80x9cforce plan optionxe2x80x9d or xe2x80x9cforce index option.xe2x80x9d Such an approach entails significant disadvantages, however. Overriding an optimizer is a highly-skilled, labor-intensive task and, as a result, a very costly proposition for users of RDBMS systems.
This manual override approach exists in stark contrast to what automation users expect from modem RDBMS systems. One of the main advantages of RDBMS systems is that this type of work should be done automatically. The normal mode of operation is that the optimizer should automatically adjust execution plans given that the data distributions in the RDBMS system changes over time. If explicit overrides need to be specified, then this advantage of an RDBMS system is negated and the costly analysis may need to be repeated over and over again. Further, the option of manually overriding a system""s optimizer is often not available to users. A growing part of RDBMS business supports xe2x80x9cVARxe2x80x9d (Value-added Retailer) applications, including, for instance, those provided by Peoplesoft(trademark), Siebel(trademark), and Baan(trademark). In these cases, the RDBMS users (i.e., end-user customers) may not even have the ability to use xe2x80x9cforcexe2x80x9d options since only the VAR has the ability to change the application. At the same time, the VARs do not want to make RDBMS vendor specific changes to their application for problems in a particular RDBMS vendor""s optimizer. All told, there exists great interest in improving an optimizer""s plan selection without requiring users to provide explicit override options.
The present invention comprises a Client/Server Database System with improved methods for optimizing execution of database queries. In an exemplary embodiment, the system includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server. The Server, operating under a server operating system (e.g., UNIX) includes a Database Server System, such as Sybase Adaptive Serverm. In general operation, Clients store data in and retrieve data from one or more database tables resident on the Server by submitting SQL commands, some of which specify xe2x80x9cqueriesxe2x80x9dxe2x80x94criteria for selecting particular records of a table.
For enhancing the speed in which the Database Server performs queries, the system provides methods employing spline histograms for improving the determination of selectivity estimates. The general approach improves histogram-based cost estimates as follows. The constant associated with a predicate (e.g., in r.a greater than 5, the constant is xe2x80x9c5xe2x80x9d) is used to do a binary search in an array of histogram boundary values, for determining a particular cell. Once a cell has been found, the system employs interpolation to find out how much of the cell has been selected. Once this interpolation value is found, it is used with a cell weighting and a spline value or weighting to estimate the selectivity of the predicate value. Here, the spline weighting takes into account how data values are distributed within the cell.
Internally, the system provides a xe2x80x9cspline estimatexe2x80x9d method, st_spline_estimate, which is invoked with three parameters: total, spline, and interpolate. The total parameter reflects the total weight of the selected histogram cell. The spline parameter indicates the positive or negative weighting of the spline. The interpolate parameter is the estimate of the portion of the cell selected, which is previously computed by an interpolate method.
The functionality of the xe2x80x9cspline estimatexe2x80x9d method is divided according to whether the spline is positive or negative. Selectivity is calculated in the case of a positive spline by adding the uniform component to the spline component. First, the uniform component of selectivity is determined by subtracting the spline value or weighting from the total cell weighting and then multiplying that quantity by the interpolate fraction, as follows.
selectivity=(totalxe2x88x92spline)*interpolate;
Now, the method adds to the selectivity value the spline component, which is calculated as the spline weighting multiplied by the interpolate fraction squared.
selectivity+=spline*interpolate*interpolate;
In a similar manner, the calculation for a negative spline also adds the uniform component to the spline component. However in that instance, the uniform component of selectivity is calculated by subtracting the spline weight from the total cell weight and multiplying that value by the interpolate fraction.
selectivity=(totalxe2x88x92spline)*interpolate;
To add the spline component, the method first calculates a new interpolate fraction by subtracting the previously-calculated interpolate fraction from the value of 1 as follows.
interpolate=1.0xe2x88x92interpolate;
Now, the selectivity calculation adds the spline component, which is determined by multiplying the spline value by the quantity of 1 minus the interpolate fraction squared, as follows.
selectivity +=spline*(1.0xe2x88x92interpolate*interpolate);
As a result of increased accuracy of estimates, the system can formulate better query plans and, thus, provide better performance.
A multi-attribute selectivity optimization methodology of the present invention is also described providing a more accurate estimate of the cost, so that the predicted performance of the final execution plan will be more accurate. In the single attribute case, histograms are used to capture selectivity of a single column but this cannot always accurately be combined to produce multi-attribute selectivities. In accordance with the present invention, the densities by how much the selectivity deviates from the single attribute density and by how much the multi-attribute densities differ from one another are used as a basis for multi-selectivity estimates. The multi-attribute densities are used to scale estimates between extremes of total independence and total dependence.
A method of the present invention for improving determination of cost estimates associated with data access occurring during execution of a database query may be summarized as follows. The system receives a database query specifying a database operation for at least one database object, where the database query specifies a query condition for selecting particular rows based on multiple attributes of at least one of the database objects. Now the method may proceed to determine an optimal execution path for executing the database query as follows. The method determines selectivity information providing a selectivity estimate for each of said multiple attributes; once this information has been determined, it may be stored persistently from one session to another, if desired. Next, the method determines correlation information that provides a measure of how well at least some of the multiple attributes are correlated; again, this may be stored in a persistent manner, if desired. Based on the determined correlation information, the method may selectively combine the selectivity information together for determining a multi-attribute selectivity estimate for the query condition. Now, this information may be employed to formulate a query execution plan for providing access to the database object(s), where the query execution plan includes an access strategy based on the determined multi-attribute selectivity estimate.