The present invention relates to the implementation of database-driven applications. One challenge with such applications is to coordinate the logic of applications with the structure and organization of external databases to which the applications refer. Patterns of query and access that make sense for a particular application may be inefficient when translated directly into searches and retrievals made against an external database. An opportunity for addressing this challenge lies in the fact that the database interface can hide the actual implementation of complex requests, reorganizing their implementation to better fit the structure of the database and the costs of access.
The standard approach to this reorganization is to require that queries be expressed in particular limited formal languages whose logical properties are well understood. Queries in this language can then be manipulated to create logically equivalent queries that are more efficient to apply against a particular external database. The most widespread of such approaches is to describe queries in terms of a relational algebra, a mathematical formalism with certain core operations and combination methods. A particular query represented in terms of relational algebra can then be rewritten in a way which is provably identical to the original query but which can be executed more efficiently against a given database or set of databases.
Much of the database management system industry has standardized on an external query format, SQL (Structured Query Language), which maps cleanly into relational algebra, allowing query rewrites to provide efficient access to external databases without forcing application writers to customize their queries or operations any further than necessary to express their data requirements as a series of SQL queries and operations. In addition, this level of abstraction allows database designers to optimize databases for different kinds of access patterns (allowing even more efficient rewrites) without the recoding or recompilation of applications.
In the past decade, new data models have emerged which are object-oriented or object-relational. These systems typically work by either translating the object or hybrid models into the same relational algebra used in conventional databases or by augmenting the relational algebra in particular ways. In general, these approaches use the same core method of rewriting queries to better fit the structure of the external database.
The rewriting approach has a number of deficits.    1. It requires that the query language be restricted enough to allow rewritten queries to be provably equivalent to the original query.    2. Effective rewriting tends to require articulating, in some detail, the structure of the database itself; this may be difficult if the database is (for example) a networked external service provided by a third party.    3. It is difficult for the rewriting process to include aspects of the practical semantics of the application and database, which could produce substantial performance improvements; such practical semantics are most commonly built into the program logic of the application and so are outside the normal scope of query rewriting.    4. Query rewriting is typically a static process (done when an application is compiled or a query is first executed) and does not reflect information gathered during the actual execution of a query.These deficits are now described further.
In order to produce provably equivalent rewrites (1), it is necessary to restrict the query language to disallow expressions that cannot be rewritten to yield provably identical queries. For example, standard programmatic constructs such as iterations and conditionals do not translate cleanly into relational algebra and so SQL normally does not handle such constructs directly. Instead, most SQL implementations break complex queries into sub-queries connected by the programmatic logic of iterations and conditionals, but the general query cannot generally take advantage of optimizations among the sub-queries. For example, the following pseudo-code fragment illustrates the problem in a very simple form:
ages=query(′SELECT id,ages FROM people WHERE course=′CS3091′>avg_age=sum(x)/size(x)heights=query(′SELECT id,height FROM people WHEREcourse=′CS3091′>avg_height=sum(y)/size(y)grades=query(′SELECT id,grade FROM people WHEREcourse=′CS3091′>if correlate(ages,grades) > correlate(heights,grades): return ′age′;else return ′height′;
It would probably be most efficient to combine the three database calls in the fragment, but database interfaces would generally be unable to do such a combination because the calls are separated by program logic outside of the normal scope of query optimization.
In order to rewrite queries as efficiently as possible (2), it is important to know the search and storage characteristics of the external database being accessed. For example, a given complex query may express certain independent operations in a particular order, but the order itself may not be logically important. A rewrite may reorder the operations but the most efficient reordering will most likely depend on the implementation of the underlying database and indexing store. An indexing store stores various indices associated with a database. When a third party provides a database, as is increasingly the case with web services (for instance), these characteristics may not be available.
Knowledge of practical semantics can dramatically decrease search times (3) and these practical semantics are generally unavailable to query rewriting. For example, a search for the children of chairs under the age of 32 should be resolved very quickly against almost any database, based on the definition of practical semantics and common sense. However, the use of such practical knowledge in optimizing queries for standard database systems is generally difficult because it can involve complex patterns of conditionals and dependencies that do not map well into a uniform relational algebra. For example, an external database can store metadata about the domains and ranges of particular relationships. Using this metadata, a code fragment such as the following:
(define (smart-get frame rel) (if (test frame ′isa (get rel ′domain)) (get frame rel)  (fail)))(smart-get (find-objects ′isa chair ′age (lessthan 32)) ′children)could use the metadata to optimize a query, but it would have to do so by patterns of conditionals, which may not map easily into a relational algebra. For example, given that the recorded domain of ‘children’ is ‘people’, if none of the “young chairs” are people (and it is unlikely that the young chairs are people) a query processing system should be able to resolve this query quickly relative to conventional database management systems using relational algebra to rewrite the query.
Finally, the static rewriting of queries rules out optimizations that are based on the particulars of a query or on information that emerges during the query's execution (4). For instance, a query on a disjunction of values (for instance, the integral years 1990, 1991, 1992, and 1993), that are not known at query run time, could sometimes be converted into a range (1990-1993) in the event that the structure and organization of the external database supports ranges, but this is not possible if the query rewriting happens entirely at compile time or initial query time, before the actual values are known.