A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to information processing environments and, more particularly, to retrieval of information stored in a data processing system, such as a Database Management System (DBMS).
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft(trademark) clients connected to one or more Sybase Adaptive Server(trademark) database servers. Both Powersoft(trademark) and Sybase Adaptive Server(trademark) are available from Sybase, Inc. of Emeryville, Calif.
As the migration to client/server continues, each day more and more businesses are run from mission-critical systems which store information on server-based SQL database systems, such as Sybase Adaptive Server(trademark). As a result, increasingly higher demands are being placed on server-based SQL database systems to provide enterprise-wide decision support. Accordingly, there is much interest in improving the performance of such system, particularly in terms of execution speed and reliability.
In current RDBMS deployments, the usage of decision support queries is becoming increasingly significant. One of the key components of these deployments is very good performance via superior optimization and data retrieval techniques. A very popular technique used to boost performance is the concept of using large buffer sizes to fetch larger chunks of data, thus reducing the number of I/O activity. This technique is commonly referred to as xe2x80x9clarge block I/Osxe2x80x9d. However, the performance gains from large block I/Os is limited to the quality of the underlying query planxe2x80x94that is, a query plan with optimal large block I/O strategy will provide the relevant benefits whereas a query plan with sub-optimal large block I/O strategy could lead to performance degradation. This decision, which is typically made by the xe2x80x9cquery optimizerxe2x80x9d module of a database system, is usually based on available statistical measures of the data. Therefore, the accuracy and smart usage of the statistical measures are critical to the generation of optimal query plans.
The present invention comprises a database system with improved query execution performance. In order to achieve as much accuracy as possible by optimizing large block I/O related query plans, the database system implements a methodology employing two new statistical measures for use in a database server. The first statistical measure, called xe2x80x9cData Page Cluster Ratioxe2x80x9d (DPCR), indicates the xe2x80x9cpackingxe2x80x9d of data pages by measuring the xe2x80x9cclusteringxe2x80x9d property of the data pages, i.e., how well clustered the consecutive data pages of a given relational table are in a fixed group of pages called xe2x80x9cextents.xe2x80x9d The ratio is measured for data page accesses either via a scan of an index or via a scan of the table. The second statistical measure, called Index Page Cluster Ratio (IPCR), indicates the xe2x80x9cpackingxe2x80x9d of index leaf pages by measuring the xe2x80x9cclusteringxe2x80x9d property of index pages, i.e., how well clustered the consecutive index leaf pages of a given relational table""s index are in the extents. The ratio is measured for index leaf page accesses via the scan of a non-clustered/clustered index. Additional formulae have been derived in the optimizer""s cost model that use these statistical measures to estimate the number of I/Os.
More particularly, the Data Page Cluster Ratio (DPCR) for an index (non-clustered or clustered) indicates the ordering and density of data pages in extents with respect to the data access via the given index i.e., how well the data pages are clustered in extents with respect to the order of the index rows. In other words, the DPCR is a measure of data page accesses that do not require more large block I/Os than necessary (i.e., the lower the DPCR, the greater the requirement for additional large block I/Os and vice-versa). In an exemplary embodiment, the database system maintains data page cluster ratios for every non-clustered and placement index on a table. It also maintains this ratio for the data page chains, if any. This ratio is gathered on a per-index basis (including one for the data pages of a table) whenever an index is created or whenever the statistics for a table/index is updated.
The Index Page Cluster Ratio (IPCR) for an index (non-clustered and clustered index) indicates the ordering and density of index leaf pages in extents with respect to the index page accesses via the given index. In other words, the IPCR is a measure of index leaf page accesses that do not require extra physical extent I/Os. In an exemplary embodiment, the database system maintains an index page cluster ratio for every clustered/non-clustered index. The index page cluster ratio is gathered on a per-index basis whenever an index is created or the statistics for a table/index is updated.
The cluster ratios are not maintained in the system statistics catalog, SYSTABSTATS. Instead, only the raw numbers for the ratios are maintained in the SYSTABSTATS catalog. With every insert/delete/update operation, the raw numbers are dynamically updated. The cluster ratios are derived on-the-fly from the current raw numbers every time the optimizer calls the statistics module via an st_gettabstatsinfo( ) method call. The derived cluster ratio is then used in the query optimizer""s cost estimates to decide if a large block I/O for a given access path is feasible and cheap. The optimizer evaluates this fact in two different phases of the optimization process. The first time this is evaluated during index selection, i.e., when the optimizer decides on the cheapest index to select for a given query. Here, after applying an initial cost estimate for the index path, the DPCR/IPCR ratios are used to factor in the effect of large block reads. Later, the optimizer evaluates the cost of doing a disk read (physical I/O) during join enumeration (applicable for the various join enumerations). Here also DPCR/IPCR are factored into the physical I/O estimations. By employing these measures, the database server""s optimizer can provide an intelligent way of considering and selecting useful large block I/O access strategies, thereby leading to increased performance gains during query execution.