The potential advantages of distributed database systems are well-known. As computing power becomes more widely available at lower prices, the most cost-effective approach to database implementation often involves harnessing many connected processors together into one large system. Some database uses, such as email or document management capabilities, are inherently driven toward distributed implementations. Distributing a database may also improve reliability, since the failure of a single processor in a distributed system will not necessarily bring to a halt all use of the database. As a result, databases are often distributed among connectable nodes in a network, with each node receiving a replica of part or all of the database.
However, distributing database replicas creates the problem of maintaining consistency, at least to some degree, between the replicas. Steps must be taken to synchronize the replicas so that a database query using one replica of the database tends (or in some cases, is guaranteed) to give the same result as a query using another replica of the database. Aspects of database transaction synchronization are discussed in commonly owned copending application Ser. No. 08/700,487 filed Sep. 3, 1996. Aspects of clash handling during synchronization are discussed in commonly owned copending application Ser. No. 08/700,489 filed Sep. 3, 1996. Commonly owned copending application Ser. No. 08/700,490 filed Sep. 3, 1996 discusses compression of "physical" update logs, namely, logs which are created and maintained more-or-less continuously during database usage. These discussions are incorporated herein by reference.
Caching part or all of a replica in memory, to reduce disk accesses and/or network traffic, may dramatically reduce the response time to a query. However, caching complicates synchronization by increasing both the number and kind of replicas present in the system. Both cached replicas and replicas stored on disk must be updated to maintain adequate consistency throughout the database. In addition, decisions must be made about when to use the cache and when to use the disk in response to a database query or update operation.
One synchronization method sends a list of cached database object identifiers and corresponding timestamps or sequence numbers from the caching node to a master node which holds a master replica. The master node compares this list with the list of objects in the master replica, compares the timestamps of objects found in both replicas, and then uses a physical update log to generate a list of update operations. The list of update operations is sent back to the caching node and applied to the cached database objects, thereby synchronizing the cached replica with the master replica.
A major drawback of this synchronization method is that some object identifiers and timestamps may be transmitted even when the cached and master copies of the objects in question are already synchronized. This wastes bandwidth, memory, and processing cycles, particularly as the number of cached objects grows. Another drawback is that flexible caching policies are hard to implement because all updates are treated the same way by all caches.
Using a physical update log to track operations on the master replica also has disadvantages. Logs can be quite large if they are not compressed, since each log must contain at least one entry (recording object creation) for each object in the replica. Even if log compression is used, physical update logs may require substantial disk storage space on the node. A synchronization checkpoint must also be maintained in the log for each other replica that can synchronize with the master replica. These checkpoints prevent a later update from being merged into an earlier one when the checkpoint falls between the two updates. They also reduce scalability in the number of caching replicas.
It would therefore be an advancement in the art to provide an improved method and system for distributed database caching to reduce the amount of unnecessary data sent between nodes.
It would be an additional advancement to provide such a method and system which support caching policies that treat specified updates differently from other updates.
It would also be an advancement to provide such a method and system which create update logs on demand rather than continuously.
Such a method and system for distributed databases are disclosed and claimed herein.