Databases typically have a Transaction Log (also known as REDO log or Write Ahead Log) that is a history of all updates executed by the database management system to guarantee the Durability property of transactions across crashes and hardware failures. The transaction log is a primal component of the database. Increasingly, this same transaction log is being employed in other data processing tasks beyond database recovery. A common approach that is utilized to track the changes in the database records is to “tail” the log, a process that reads the log records in the order of their Log Sequence Number (LSN). Every record in the database transaction log uniquely identifies a transaction and is assigned a monotonically increasing Log Sequence Number (LSN). Log tailing is often used for: database replica maintenance (a process called log shipping); to observe all changes happening in the database and to further act upon those changes; and database backup.
There are several nice properties of Log Tailing that makes it suitable for these tasks. The clients tailing the log have to maintain an offset into the log and they are guaranteed exactly-once delivery of all the changes happening in the database or data store. The client can download the log at a rate which suits them as opposed to the server pushing the logs to them. The same log can serve virtually unlimited number of clients. In case of failure of the database server or the log client, it is possible to restart the tailing process at the earlier offset without missing or duplicating any update.
However, log tailing becomes less attractive with increases in write throughput on the database. As the log volume increases, the amount of work that the log tailers need to do increases proportionately. In such scenarios, load on the Log observers can be reduced if somehow it is possible to collapse updates to the same database record into a single update for the Log observer.
The desire to collapse large volume changes at the database record level has led to creation of other ad-hoc mechanisms. These typically involve aggregating the updates for a short duration and flushing these updates periodically to another storage system. The client can consume the reduced volume of changes from this other storage system. This mechanism is unsuitable because it is difficult to provide reliability and exactly once semantics, unless performance killing distributed transactions are used. The load decreases, but by a fixed amount, for all change observers whereas slower observers could have benefitted from even longer aggregation. Also, the presence of another storage system adds another layer of complexity.
The present invention discloses a method that possesses all the good properties of log tailing method and at the same time provides a database record level aggregation of changes for each client. In one possible implementation of this scheme the client receives exactly the set of database records that changed since its last call. Multiple updates to a database record in the time window since the client's last call results in a single update for the client. This method of client interacting with the database or data store is being called Tail Aggregated or TAGG.