Data warehouse sizes have been growing in leaps and bounds. An important concern is the storage costs associated with it. This is addressed by the periodic archiving of old data, which might be accessed less often or by its summary removal from the database when not needed. Both methods require the mass delete of data from the warehouse. This is also known as Rollout or as Bulk Delete. The space thus freed up is used to make way for new data that is available. For example, a company might have a warehouse of five years of data. At the end of every month it might delete the oldest month of data and bring in data for the latest month.
In the past, such mass deletes were usually performed in a maintenance window when the system load was low such as after midnight. Recent trends indicate users are moving towards a shorter time frame to perform this type of maintenance activities. Customers want their systems to be available all the time, non-stop, even for a warehouse. Also, the amount of data being rolled out is becoming smaller but it is being done more frequently. These factors make an efficient online rollout mechanism very important for a database engine. The efficiency can be measured by various parameters, like, response time of a rollout, the amount of log space used, the number of locks required, the response time of a rollback of the rollout, how quickly the space freed can be reused and what kind of concurrent access to the table is allowed when the rollout is going on.
The conventional delete mechanism employed by database engines generally works horizontally, on a record at a time. In this a record is deleted and the defined indexes are updated one by one to reflect the delete of that record. An example of this is seen in DB2 UDB V7. For mass or multiple record deletes, one iterates over all records to be deleted in a similar fashion. A delete of a record entails logging the activity in a log file and then doing the physical delete from the table. If one has indexes defined on the table then each index update has to be logged and its structure changed to reflect the delete. The index updates are very costly and even if there are only a few defined, they could overwhelmingly dominate the total cost of the delete. Existing works such as “Efficient Bulk Deletes in Relational Databases” ICDE 2001 and “Performance Study of Rollout for Multi Dimensional Clustered Tables in DB2” in EXPDB 2006 show such increase in cost. For example, the latter reports that the response time for a delete jumping from 15 seconds when there were no rid indexes (record-based indexes) on the table to 826 seconds when one index on partkey column was included. The cost of updating the rid index was approx 811 seconds and it dwarfs all others costs.
If the data in the table is clustered and bucketized on some columns or a function of some columns, such that all records in a bucket or cell are of the same value for that column or its function, one could optimize the delete with conditions on these columns by logging the entire delete of the bucket one time only but logging any indexes individually for each record. Consider a table whose logical view is shown in the left cube 102 of FIG. 1. It has data bucketized and clustered on nation, year and itemid. An example rollout on this table is shown in right 104 of FIG. 1. Here a delete happened with conditions on columns year (value 1992) and itemid (value 1), which results in entire buckets being dropped. What is shown at 104 of FIG. 1 is the resultant logical view of the data after the delete. This method is described in “Performance Study of Rollout for Multi Dimensional Clustered Tables in DB2.” While this tactic provides some relief, improved method may be desirable when there are a large number of rid indexes (record-based indexes) defined on the table that needs to be updated.
One could have buckets in a table as described previously and have indexes defined on those individual buckets rather than the entire table. So if there are 100 buckets there may be 100 individual indexes defined on these buckets rather than one index for the entire table. In such a case when one deletes a bucket, one could mark the entire bucket and its “local” index as deleted as one log event and save a lot of response time. But having these types of indexes (known as local indexes) has their own limitations and is not always ideal in all cases.
For a table partitioned into a huge number of small buckets, an improved method may be desirable. For example, DB2 supports a feature called Multi Dimensional Clustering (MDC), which allows a table to have as much as (2^31) buckets.
Tandem Non-Stop SQL attempts to solve the problem of updating rid indexes by doing the update of multiple indexes in parallel rather than one after the other. A mechanism for bulk deletes is explained in Gartner, A., Kemper, A., Kossman, D., Zeller, B., “Efficient Bulk Deletes in Relational Databases”, Proceedings of the ICDE 2001. The method described, is based on vertical deletes of the base table and any rid indexes defined on it. This is to be contrasted with the conventional method of deleting the table record and updating the rid indexes iteratively for all qualifying records.
U.S. Patent Publication No. 2006/0034277 is a method for reporting reception result of packets in mobile communication system. The publication discloses a bitmap structure, which enables the size of a bitmap field containing reception result information to be significantly reduced while fully performing its acknowledgment function. This patent application describes a bitmap scheme for the communication industry.
U.S. Patent Publication No. 2005/0207353 describes distribution of identifiers in serverless networks. The method may be useful in and applied to serverless telephony systems. It details a description of a hierarchical bitmap and its specific use. U.S. Patent Publication No. 2005/0036494 describes distribution of identifiers in serverless networks, similar to above for serverless telephony systems.
U.S. Pat. No. 6,658,619 describes systems and methods for implementing hierarchical acknowledgement bitmaps in an ARQ protocol. U.S. Pat. No. 6,937,611 describes a mechanism for efficient scheduling of communication flows. A system for servicing communication queues described in that patent may include memory configured to store a hierarchical channel map having a plurality of levels wherein each bit of the lowest level is mapped to a different one of a plurality of communication channels and wherein each bit of each higher level is mapped to a group of bits at the next lower level. The system may include a host adapter configured to maintain the hierarchical bitmap wherein each bit at the lowest level is set if the channel to which it is mapped has a pending communication request and is cleared if not. Each bit of each higher level is set if at least one bit is set in the lower level group to which is mapped and cleared if not. The host adapter may be configured to examine the hierarchical bitmap in order to determine a next one of the communication channels to service. At each level of the hierarchical channel map a service mask may be provided to track which bits have already been traversed at that level. Multiple such service masks may be provided at each level to support different service classes.
U.S. Pat. No. 5,442,758 describes apparatus and method for achieving reduced overhead mutual exclusion and maintaining coherency in a multiprocessor system utilizing execution history and thread monitoring. It describes a mutual-exclusion apparatus for maintaining data coherency while concurrently reading and updating a current generation data element, which could use a hierarchical bitmap.
U.S. Pat. No. 6,175,900 describes hierarchical bitmap-based memory manager. A hierarchical bitmap-based memory manager maintains a hierarchical bitmap having an entry for each memory block in a memory heap. Each bitmap entry contains a multi-bit value that represents an allocation state of the corresponding memory block. The memory manager manages allocation, deallocation, and reallocation of the memory blocks, and tracks the changes in allocation state via the hierarchical bitmap. Using a two-bit value, the bitmap can represent at most four different allocation states of the corresponding memory block, including a “free” state, a “sub-allocated” state in which the corresponding memory block is itself an allocated set of smaller memory blocks, a “continue” state in which the corresponding memory block is allocated and part of, but not last in, a larger allocation of plural blocks, and a “last” state in which the corresponding memory block is allocated and last in an allocation of one or more memory blocks.
U.S. Pat. No. 6,640,290 describes easily coalesced, sub-allocating, hierarchical, multi-bit bitmap-based memory manager. A hierarchical bitmap-based memory manager maintains a hierarchical bitmap having an entry for each memory block in a memory heap. Each bitmap entry contains a multi-bit value that represents an allocation state of the corresponding memory block. The memory manager manages allocation, deallocation, and reallocation of the memory blocks, and tracks the changes in allocation state via the hierarchical bitmap. Using a-two-bit value, the bitmap can represent at most four different allocation states of the corresponding memory block, including a “free” state, a “sub-allocated” state in which the corresponding memory block is itself an allocated set of smaller memory blocks, a “continue” state in which the corresponding memory block is allocated and part of, but not last in, a larger allocation of plural blocks, and a “last” state in which the corresponding memory block is allocated and last in an allocation of one or more memory blocks.
U.S. Pat. No. 6,874,062 describes system and method for utilizing a hierarchical bitmap structure for locating a set of contiguous ordered search items having a common attribute Application. A system and method is provided for locating, within a set of ordered items, N contiguous items having a desired attribute. The system utilizes a hierarchical bitmap structure.
U.S. Pat. No. 6,886,162 describes high-speed methods for maintaining a summary of thread activity for multiprocessor computer systems A high-speed method for maintaining a summary of thread activity reduces the number of remote-memory operations for an n processor, multiple node computer system from n.sup.2 to (2n−1) operations. The method uses a hierarchical summary of-thread-activity data structure that includes structures such as first and second level bit masks. The first level bit mask is accessible to all nodes and contains a bit per node, the bit indicating whether the corresponding node contains a processor that has not yet passed through a quiescent state.
U.S. Pat. No. 5,504,889 describes a method and system for monitoring attributes of files, such as whether a file has been read, in a computer system includes a file index organized into groups, with at least one group including one or more files. This is done using a hierarchical bitmap structure.
Performance Study of Rollout for Multi Dimensional Clustered Tables in DB2, which appeared in EXPDB 2006 relates to bulk deletes. Efficient Bulk Deletes in Relational Databases, ICDE 2001 describes vertical deletes (in contrast to the conventional horizontal delete).