The present disclosure is directed towards database caching over distributed networks. The proliferation of distributed web applications is increasing the frequency of application queries to remote database servers. To improve the performance of such queries and enhance data availability, such applications can use a local database cache. For instance, an edge server in a content distribution network can use a nearby database cache to speed up data access and generate dynamic web content more quickly at the edge of the network.
Typical techniques to cache data on edge servers rely on either (i) explicit replication of the entire database or an explicit part of it on a local machine, or (ii) the caching of previous query responses and the exact-matching of new query statements against previously cached responses. In the replication approach, the contents of the cache are explicitly specified by an administrator who must determine which parts of the database tables are to be replicated on the edge node. Once the cache contents are specified, either as table name or a “materialized view” definition, the data is copied from the origin server to the edge cache.
In the query response caching approach, the cache is dynamically populated with the responses of application queries. The data in the cache is described by a list of query responses, with each response tagged with the query statement from which it was generated. A response is used to answer a subsequent query only if that query matches, typically through a string comparison, the query string corresponding to the cached response. Query response caches eliminate the need for administrator control by dynamically caching data, but store data inefficiently in separate regions, one region per query response. This induces a high space overhead as the same base data may be replicated in many query responses. This is because query responses often overlap, typically including a significant and common part of the underlying database. Furthermore, this approach leads to limited performance benefits because a cached query response is used only to satisfy an exact resubmission of the same query, but not other queries that may be contained in the response. For example, given an initial query to find the social security number of all employees who are less than 30 years old. The SQL statement would be:
SELECT employee.ssn FROM employee WHERE employee.age<30
Assuming that the response for this query is cached, and that the cache receives a new query to find the social security number of all employees who are less than 25 years old, that SQL statement would be:
SELECT employee.ssn FROM employee WHERE employee.age<25
Although the response of the new query is included in the response of the first since all employees that are less than 25 years old are also less than 30 years old, a query response cache based on exact-matching would not be able to service the query from the cache.
In summary, explicit administrator-defined data caching requires manual involvement and presumes that the administrator has intimate knowledge of the workload and the resources on each edge server. Query response caches eliminate administrator overheads, but suffer from limited effectiveness and high space overhead. Furthermore, consistency management becomes complex because of the mismatch between the representation of the cached data and the base data in the origin server. Consistency control generally requires either invalidating all query responses when any base tables change, or maintaining complex dependency graphs.
It is therefore desirable to have a cache that does not require an administrator to specify its contents, or to adapt that specification according to changes in the workload or to changes in the availability of resources on the machine where the cache resides. It is further desirable for the cache to be efficient in storage overhead and in consistency maintenance.