Present invention embodiments relate to database maintenance, and more specifically, to removing garbage data (e.g., data that is no longer referenced or obsolete) from a database that employs separate transaction identifier storage.
In a relational database, tables of data are stored in which data from one table may have some relationship with the data stored in another table. The relationships between the data in the various tables allow the processing of queries (e.g., database searches) in an orderly fashion.
When plural users or computer processes have access to the same database simultaneously (i.e., concurrently), issues may arise with respect to changing the existing data in the database. For example, if a database record is in the process of being created or modified, many database systems that do not provide concurrency control will “lock” that record for the duration of the update. Thus, in order to avoid the “locking” of individual or groups of records, multiple copies of the same record are permitted using multi-version concurrency control (MVCC) whereby multiple transactional changes to a database record are reconciled at some later point in time. However, when a database employs MVCC, multiple users or processes may change a record and multiple copies of the same record (with their corresponding changes) are stored until all changes are reconciled and committed.
Thus, in order to reconcile multiple changes to a given record, transaction identifiers (TIDs) are maintained for each copy or version of a record with actual or attempted record changes. A version of a record is marked for removal, either due to deletion or update of the record, by modifying the TID associated with that record. A tuple marked for deletion is not necessarily a garbage tuple that can be permanently removed since that tuple may be visible to other executing transactions, and should therefore, be retained. Existing approaches for performing database garbage collection or removal of garbage data include waiting until an entire page or set of data become garbage for removal, or by incrementally deleting garbage tuples as they are designated. In many cases, it is the responsibility of the system operator or user to invoke garbage collection to avoid conflicting with real-time production operations. These approaches have drawbacks in that garbage may be persistent for a period of time or otherwise consume current processing time since that data may still be processed or filtered. In addition, removing garbage by scanning through all data is both time consuming and competes with other operations. Furthermore, the database system does not “know” the quantity of garbage tuples present in the data store at any given time in order to trigger a given garbage collection event (e.g., the system cannot automatically garbage collect or prompt the user to initiate such garbage collection).