The present disclosure relates to managing data obsolescence in relational databases.
Databases are used to store information for many applications, including various commercial, industrial, technical, scientific and educational applications. The volume of data being stored and processed in databases continues to increase seemingly inexorably. However, the ability of database systems to store data and to process queries is inevitably limited. Relational databases organize data into formally-defined tables of tuples or records. The tables are typically accessed using a computer language such as Structured Query Language (SQL) in which the tables consist of rows for the records and columns for the data values.
A database management system (DBMS) is the entity responsible for processing database queries to access the relevant data stored in a database. How best to process database queries, e.g. in terms of speed and consumption of internal processing resource, is referred to as query optimization.
One significant performance bottleneck when processing database queries is the input/output (I/O) activity associated with retrieving data from memory. An efficient way of reducing query I/O in a relational database is to use bitmap (or bit array) indexing. A bitmap consists of a vector of 1-bit elements in which each element corresponds to a row of a table. Bitmaps are particularly useful in searching table columns over which an index has been built for a particular selection criterion or criteria. To execute a query, the DBMS scans the index and creates a bitmap by setting each bitmap element to either a ‘1’ or a ‘0’ depending on whether the value in the corresponding row of that column satisfies the selection criteria. A ‘1’ typically indicates that the selection criteria is satisfied while a ‘0’ indicates otherwise. (The reverse is also possible.) By using a bitmap, the database engine can forego testing database records having ‘0’ entries and, moreover, avoid retrieving these records at all. As a result, the amount of I/O needed to process a database query can be significantly reduced.
Bitmaps are useful for notification applications in which repeated, often similar or identical, queries are applied to a notification application database, which is continuously being added to with new records. If the database table becomes too large, the queries are slowed and the increasing size of the database consumes increasing amounts of memory and mass storage. Here it is known to remove obsolete data from the database based on a retention age limit in a process referred to as vacuuming.
One factor that links query performance to underlying hardware is that data storage in a computer system will be tiered, with a hierarchy between the fastest access, lowest volume memory local to the processors and the slowest access, largest volume memory and mass storage devices which have to be accessed over a system bus or network connection. Memory access speed lags are usually referred to as latency. Memory access times may be tiered following a basic split between volatile memory (e.g. RAM) with faster access times and non-volatile memory (e.g. ROM) with slower access times. There may also be significant splits within those groups, e.g. for the volatile memory on a particular motherboard there may be a split between on-processor memory, one or more types of cache memory and general memory. At any one time, different parts of a database will therefore likely be stored at different hierarchical storage levels on the host computer system with a range of latencies. Sometimes, in-demand data is referred to as ‘hot’ and data which is infrequently accessed ‘cold’, leading to the term multi-temperature storage (MTS) being used.
The perfect (and unattainable) situation would of course be that all records that will need to be searched when processing the next query are already stored in the fastest access tier of memory. The more practical aim is to make sure that records which, based on past activity, are unlikely to be the subject of queries in the near future are relegated to slow access memory tiers, whereas those which are predicted to be the subject of queries in the near future are resident in fast access memory tiers. For many types of application, it is unacceptable or undesirable to permanently delete obsolete records with a vacuuming process; rather all records should be retained, but with the obsolete ones preferably being moved to archive, e.g. high latency memory or mass storage.
In the following, we refer to data relating to records which are deemed to be unlikely to be the subject of queries in the near future as obsolescent data. What is described in the following is a method of managing the obsolescence of data in a database.