1. Field of the Invention
This invention relates in general to computer-implemented database systems, and, in particular, to query optimization using a multi-layered object cache.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples or records. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on direct access storage devices (DASD), such as magnetic or optical disk drives for semi-permanent storage.
The integration of object technology and database systems has been an active area of research for the past decade. One important aspect of the integration of these two technologies is the provision of efficient, declarative query interfaces for accessing and manipulating object data. Compared to other aspects of object-oriented database (xe2x80x9cOODBxe2x80x9d) technology, such as integrating persistence into object-oriented languages (e.g., C++ and Smalltalk), queries were given relatively little attention in the early days of OODB research.
A number of proposals for OODB query languages have appeared in the database literature. Query rewrite transformations have been developed for relational DBMSs. Many of these transformations also apply for Object Query Systems. However, new query rewrite transformations that apply specifically to Object Query Systems need to be developed. Predicate pushdown, which is a query rewrite transformation, is the notion of taking a query and determining which parts of the query can be migrated through the layers of the schema to the databases where the data resides. The objective is to use the power of the database query function to do data filtering, and, thereby, restrict the amounts of data that have to be transferred from the database servers to clients.
Predicate pushdown can include all of the predicates that define a query""s result, in which case the task of restricting the result set is entirely performed by the databases where the data resides. Predicate pushdown can include partial predicates that define a query""s results, in which case some of the predicates (e.g., a subset of the conjuncts that define a query""s result) are passed down to the databases where the data resides, thereby restricting the results returned by these databases. The remaining predicates that could not be pushed down are then applied in object space by the query evaluator. Finally, if redicate pushdown cannot be applied, the predicates that define a query""s results must be applied in object space after having retrieved the complete sets of data referenced in the query.
Query evaluation using a client cache is presented in Shaul Dar, Michael J. Franklin, Bjxc3x6rn T. Jxc3x3nsson, Divesh Srivastava, and Michael Tan, Semantic Data Caching and Replacement, Proc. 22nd International Conference on Very Large Data Bases, Mumbai, August 1996, [hereinafter xe2x80x9cDar et al.xe2x80x9d]. Dar et al. focuses on determining whether a query can be resolved from the client cache alone or whether a partial query result can be obtained from the client cache with the remaining result drawn from the database server. The technique that is used for query evaluation is predicated on maintaining a semantic description of the client cache content. For a given table, the semantic description is a constraint that is dynamically modified to include new cache entries. For example, if a query initially retrieves all employees having a salary between 50,000 and 100,000, the constraint describing the cache content for the employee table is salxe2x89xa750000 and salxe2x89xa6100000. If a subsequent query requests employees having a salary between 60,000 and 80,000, that query result can be drawn from the cache alone. A similar approach called predicate-based caching is presented in Arthur M. Keller and Julie Basu, A Predicate-Based Caching Scheme for Client-Server Database Architectures, The VLDB Journal, 5:35-47, 1996.
The ORION system is described in the paper entitled Architecture of the ORION Next-Generation Database System, by W. Kim, J. Garza, N. Ballou and D. Woelk, published in IEEE Transactions on Knowledge and Data Engineering, 2(1), in March 1990. In ORION, deferred updates result from altering the schema of a class, as, for example, when an attribute of a class is dropped from its schema. The task of updating all records of the class to remove the attribute from each instance is deferred until some later time. This allows schema updates to proceed quickly. ORION is an OODBMS which implements a client-server organization with support for queries that return handles on objects of a single class as query results. However, due to this restriction, ORION queries are limited to semi-joins, which are expressed as path queries (referred to as xe2x80x9cnested class queriesxe2x80x9d). As an optimization, ORION queries can run both on the server and on the client computer. In that system updates to client resident objects which have not been propagated to the server can be reflected in query results using either a single buffer or a dual buffer evaluation scheme. Using the single buffer scheme, updates in the client cache are flushed to the server and the query is evaluated against the server.
However, this scheme is incompatible with optimistic locking, since updates can only be flushed at commit time. The dual buffer evaluation scheme runs a query against xe2x80x9cdirtyxe2x80x9d objects in the client cache and against objects in the server. The two result sets are merged; result objects in the server""s result set, which are also present in the client""s result set, do not participate in the final result. This scheme is incompatible with optimistic locking. In this scheme, for example, there could have been an object O read into the cache by an earlier query within a single ongoing transaction. The object has since been updated in the database by a separate unrelated transaction. While the cache copy is xe2x80x9ccleanxe2x80x9d and qualifies the query""s result, the database copy no longer qualifies the query""s result and is not returned by the server query. Since O is not xe2x80x9cdirtyxe2x80x9d, it is not selected by the client compensating query.
In the ORION system, the dual buffer evaluation scheme is also presented for path queries. The scheme involves pushing partial updates back to the server, which is incompatible with optimistic locking. Furthermore, this technique has problems with respect to duplicate semantics, if extended to full join queries. While there are buffering differences, objects are homogeneous in both the client and the server. Moreover, this system is not concerned with creating an object-oriented representation of data from heterogeneous sources, and does not support extended SQL-92 queries which can return values, as well as objects, and, aside from paths, include operations which are joins, grouping and aggregation, and union.
The Garlic system is described in the paper entitled Loading a Cache With Query Results, by Laura M. Haas, Donald Kossman and Ioana Ursu, published in Proc. 25th International Conference on Very Large Data Bases, pages 351-352, Edinburgh, in September 1999. The Garlic database middleware system supports object queries that return handles to objects as query results and allows the application of methods in queries over collections of objects. The Garlic system can return handles on objects as query results without retrieving the data associated with these objects. Retrieval is performed by a separate key-based query, issued by the application when attributes of the object are accessed. Therefore, the Garlic system allows building objects strictly by retrieving data for the key attributes and, alternatively, retrieving data for all of the object""s attributes.
Moreover, Garlic has a double caching architecture to manage objects created during query evaluation and to manage objects which are returned to the application as query results. The Garlic optimizer evaluates the cost of moving cache operators to the top of a query plan or to the bottom of a query plan. Moving the cache operators to the top of the query plan minimizes the number of irrelevant cache hits (i.e., objects added to the cache which do not participate in the query""s result), but incurs an additional cost of migrating all the data needed by the caching operations through the plan operators. Moving the cache operators to the bottom of the plan has inverse properties. However, the Garlic system does not yet address the problem of evaluating queries with deferred updates in the cache.
The Persistence system is described in the paper entitled Persistence Software: Bridging Object-Oriented Programming and Relational Databases, by A. Keller, R. Jensen and S. Agrawal, published in Proc. ACM-SIGMOD International Conference on Management of Data, Washington D.C., in May 1993. Persistence is an object-oriented database middleware system that supports the automated generation of C++ schemas and methods from database schemas. The C++ classes implement xe2x80x9cgetxe2x80x9d methods, to retrieve an object into the cache based upon its key. However, it is not possible to execute queries against the object schema directly, although it is possible to execute xe2x80x9ccache loadingxe2x80x9d queries in SQL. These SQL queries are based upon the database schema to bulk load a set of objects into the cache, instead of faulting objects into the cache from the databases, one-at-a-time, when relationships are traversed or when application objects are requested individually.
Object middleware servers use an object cache to present an object-oriented view of heterogeneous data derived from backend databases. The object cache implements object identity such that an object retrieved from the database is represented only once in the cache. Cached objects can be updated while the propagation of these updates to the backend databases can be deferred, say until the end of the transaction or at some other point in time. Queries expressed against the object model must reflect in-cache updates. The cost in performance of running these queries can be high, since the relevant data for the query results must be extracted from the backend databases into the cache and the query then executed against the cache.
Therefore, there is a need in the art to extend the previous systems in a middleware system with a less-expensive multi-layered object cache connected to a database. Furthermore, it is preferable to allow rewriting object-oriented path expressions into join clauses which can be pushed down to the underlying database, thus further optimizing query performance
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for a computer-implemented technique for query optimization using a multi-layered object cache.
In accordance with the present invention, a method, apparatus, and article of manufacture for a computer-implemented technique for query optimization using a multi-layered object cache is presented. An object-oriented query is executed to retrieve data from a database. The database is stored on a data storage device connected to a computer and queries are run against lower cache layers which have better performance characteristics than the external or upper layers.
The multi-layered cache has an application objects (AOs) cache and a data access objects (DAOs) cache, and the application objects include methods deriving data from DAOs. The method includes a step of rewriting the query into a query directed against the DAOs cache, applying a pushdown transformation algorithm to the rewritten query directed against the DAOs cache, and executing the transformed query against the DAOs cache.
The method also allows optimizing queries with path expressions by transforming path expression into joins. Another performance optimization includes using a limited data set, only including the inserted and updated data, which allows the method to evaluate queries against a subset of a collection of cached objects instead of an entire collection of cached objects.