1. Field of the Invention
The present invention relates to data processing apparatus and corresponding methods for the retrieval of data stored in a database or as computer files. More particularly, the invention relates to database or file accessing.
2. Description of the Prior Art
Individuals and organizations are increasingly storing data electronically in databases, a collection of the data arranged for ease of storage, retrieval, updating, searching and sorting by computerized means. As the size, number, and complexity of such databases is growing, sophisticated Data Base Management Systems (DBMS) are continually being developed and improved to facilitate database use and management. In a modern DBMS the data may be stored at multiple, non-contiguous locations within one storage volume, or spanned across multiple volumes. Such a DBMS may also be used for multiple purposes, often by multiple users, effectively concurrently.
FIG. 1 (background art) is a block diagram conceptually depicting the basic elements and operation of a DBMS 10 that are most relevant to the problem this invention addresses. For present purposes it is a given that data has been pre-stored in a database 12. A database engine 14 communicates with the database 12 to retrieve and work with the data, and in the course of its work the database engine 14 employs a buffer cache 16. In operation, a query 18 is provided to the database engine 14. The database engine 14 then directs retrieval of the data the query 18 will work with from the database 12 and its storage, all or in part, in the buffer cache 16. The database engine 14 then operates on the data in the buffer cache 16, directing additional data retrieval and storage in the buffer cache 16 if necessary, until processing of the query 18 is complete and a result 20 is arrived at.
FIG. 2 (background art) is a block diagram conceptually depicting the contents of the database 12 of FIG. 1. Specifically, these contents are tables 22, 24, 26. In actual practice, a database may contain a lesser or greater number of tables than is shown in FIG. 2, and databases typically contain many more.
FIG. 3 (background art) is a block diagram conceptually depicting the contents of a table, say, table 26 of FIG. 2. Here it can be seen that table 26 contains a plurality of elements of data 28 stored in a plurality of rows 30. The data 28 in the rows 30 is often referenced by columns 32, and FIG. 3 shows four columns labeled C1, C2, C3, C4. In actual practice, a table may contain a lesser or greater number of rows and columns than are shown in FIG. 3, and it is not unusual for tables to contain tens of columns and thousands of rows. FIG. 3 shows an optimal situation where the table 26 is stored contiguously. In actual practice, constraints often exist that do not permit this.
FIG. 4 (background art) is a block diagram conceptually depicting the contents of table 26 again, now grouped into pages 34 in an extent 36. Data processing systems, including DBMSs and their sub-systems, have limits on the quantity of data they can handle at once. There always is an ultimate limit, and this and other considerations usually motivates selection of a lesser quantity of memory termed a “page” that the system works with. Particularly with respect to moving data into and out of memory, such as the buffer cache 16 in our example here, data is often spoken of as being paged into and paged out of memory. The pages 34 in FIG. 4 are simplified examples containing only three rows of data each. An extent 36 is a set of physically contiguous pages 34 of a table, and the table then consists of a set of such extents 36.
FIG. 5 (background art) is a block diagram conceptually depicting a common situation in a large database, say, database 12 of FIG. 1-2. The totality of the database 12 in our example here thus resides in three disk drives 38, 40, 42. The tables 22, 24, 26 are still stored in the database 12, but it now can be seen that they are stored as extents 36 in the various disk drives 38, 40, 42. In FIG. 5 the tables and extents are represented by the notation: Table_id.Extent_id. For instance, T1.E4=the 4th extent of table T1. Thus, a simple query like “select * from T2” on execution would turn into something like: Scan T2.E1, . . . , T2.E7.
As the DBMS has grown in importance and use, it has become increasingly desirable to optimize performance. Many approaches have been tried, most of which are not germane here. Of present interest, however, is optimizing the processing of queries, i.e., the scanning of particular sub-sets of the stored data. For this, the data that is needed by a query, i.e., its extents, is copied into the buffer cache. Unfortunately, many factors can combine to complicate performing efficient data scans across query extents in this overall arrangement.
Even after optimization in other respects, some queries may need to access hundreds of extents, performing full data scans across those. If there are multiple queries being executed, it is quite possible that those queries will access much of the same data. However, as matters exist now, the ability to share that data is largely arbitrary. Depending on the order of the list of extents for data brought into the buffer cache for a given query and when other queries start, the ability to share the cached data for multiple scans is thus based on luck.
Some database engines today separate each extent into a separate thread, such that each thread is responsible for scanning only one extent and returns the results to a join/merge operator which aggregates the results. But in a system with limited resources where starting that many threads would consume the available resources, a thread may be given a whole list of extents to operate on. For our purposes here, we will assume that a single thread is given the list of extents to scan, and rather than speak in terms of “threads” we will herein use the word “scan” to better represent a list of extents to be operated on.
Let us consider a simple hypothetical case using a query Q1 that uses an extent E1, a Q2 that uses an E2, and a Q3 that uses an E3, wherein E3 is a subset of E1. If E2 is large enough that it “pushes” relevant parts of E1 out of the buffer cache, it follows that Q1, Q2, Q3 is not an efficient ordering of query execution. Generalizing, if two or more scans are active and share a large set of the extents required to satisfy their respective queries and have no coordination between them, buffer cache thrashing can take place where pages of one scan replace pages of the other scan within the buffer cache. This causes increased I/O as well as slowing overall query processing.
Alternately, let us consider a case with queries Q1 tthrough Q25, respectively using extents E1 through E25, wherein E1 and E25 are the same. Running Q1 and Q25 separately is not efficient. Attempting to get “multi-streaming” by utilizing very large buffer pools is an option, but not a viable one in most cases. Buffer cache has inherent “costs” that weigh against unduly increasing its amount, e.g., components cost, power usage, physical footprint, etc. It is also usually desirable for a DBMS to free up available memory resources for other operations, like sorts and hash joins.
The attempts to date to improve DBMS performance have largely involved optimizing queries for execution. For example, Pat. App. US 2002/198872-A1 by MacNicol et al. discloses a “Database System Providing Optimization Of A Group By An Operator Over A Union All.” It describes a database system providing methods for optimization and execution of queries aggregating information from multiple database tables. The method here breaks serial operations into a group of child operations that can be individually optimized and processed in parallel. A query including a grouping operator over a union node is split by pushing the grouping operation down into each input arm of the union node. Specifically, grouping operators are inserted into each child arm of qualifying union nodes, enabling separate physical database tables to be summarized in parallel. Each of the child grouping operators concurrently groups data and generates summary results in parallel, taking advantage of capabilities of available parallel processing hardware. The aggregate operators of the original master grouping operator are modified to accept and correctly synthesize sub-aggregated results generated by each of the child grouping operators. The method here also enables individual optimization of operations on each database table, including use of index-only execution plans.
As such, this reference provides background on the problem and teaches one approach that has been tried, but that approach does not include a runtime component to query execution that monitors extent scans across all queries being executed and allows the re-ordering of the extents list for each query so that queries accessing same extents will have a much higher chance of finding the data they need in the buffer cache.
U.S. Pat. Nos. 5,813,005 and 6,510,428 by Tsuchida et al. both titled “Method And System Of Database Divisional Management For Parallel Database System” and U.S. Pat. No. 5,761,696 by Giordano, et al. discloses a “Parallel Database Serving Mechanism For A Single-Level-Store Computer System” teach parallel approaches. The former references teach parallel processing on multiple platforms with just one I/O server, thus teaching a approach for use in some limited hardware configurations. However, this does not include a runtime component to ensure a much higher chance of finding needed data in buffer cache. The latter reference teaches cross-address space mapping of query results, thus addressing matters after extents are loaded into cache.
It is, therefore, an object of the present invention to provide an improved system for managing buffer cache in a DBMS. Other objects and advantages will become apparent from the following disclosure.