Different data formats have different benefits. Therefore, techniques have been developed for maintaining data persistently in one format, but making that data available to a database server in more than one format. For example, in a dual-format database system, one of the formats in which the data is made available for query processing is based on the on-disk format, while another of the formats in which the data is made available for query processing is independent of the on-disk format.
The format that corresponds to the on-disk format is referred to herein as the “persistent format” or “PF”. Data that is in the persistent format is referred to herein as PF data. An in-memory format that is independent of the on-disk format is referred to as a “mirror format” or “MF”. Data that is in the mirror format is referred to herein as MF data. For example, in one embodiment, the persistent format is row-major disk blocks, and the mirror format is a column-major format. Such a dual-format database system is described in U.S. patent application Ser. No. 14/337,179, entitled MIRRORING, IN MEMORY, DATA FROM DISK TO IMPROVE QUERY PERFORMANCE (hereinafter the “Mirroring Application”), the contents of which are incorporated herein by this reference.
As explained in the Mirroring Application, the mirror format is completely independent of the persistent format. However, the MF data is initially constructed in memory based on the persistently stored PF data, not based on any persistent MF structures. Since persistent MF structures are not required, users of existing databases need not migrate the data or structures in their existing databases to another format. Thus, a conventional database system that uses row-major disk blocks may continue to use those disk blocks to persistently store its data without performing any data migration, while still obtaining the performance benefit that results from having a column-major representation of the data available in volatile memory.
In some embodiments, the MF data is compressed. The compression can be performed at various compression levels, either specified by the user or based on access patterns. In an embodiment in which the MF data is compressed, the MF data may be organized, within volatile memory 102, into “in-memory compression units” (IMCUs).
In one embodiment, when a data item is updated, the copy of the data item in the PF data is updated, but the copy of the data item in the MF data is not. Specifically, the data item copies that are in IMCUs are not updated in response to updates because the overhead involved in decompressing the IMCU, updating the contents thereof, and then recompressing the IMCU could significantly reduce system performance. Instead, those data items are marked as “invalid” within the IMCU, and the updates to the data items are stored outside the IMCU. Consequently, as more and more updates are made to data items contained in an IMCU, the IMCU becomes increasingly stale. The more stale an IMCU, the less efficient it is to use the IMCU, because the current values of the invalid items need to be obtained from another source, such as a journal, the buffer cache, or the on-disk PF data.
Rather than let IMCUs become so stale that they no longer improve database performance, the IMCUs can periodically be “repopulated”. Repopulating an IMCU, which is also referred to as “refreshing” or “merging”, involves reconstructing the IMCU with more current data. Thus, to repopulate an IMCU that contains columns c1, c2 and c3 for table emp, the database server would have to obtain all of the current values from c1, c2 and c3, organize those values in column vectors, compress the column vectors, and then package the compressed column vectors into an IMCU. Various techniques for repopulating an IMCU are described in detail in U.S. patent application Ser. No. 14/337,045, entitled GRANULAR CREATION AND REFRESH OF COLUMNAR DATA, the contents of which are incorporated herein by reference.
Unfortunately, repopulating IMCUs incurs a significant amount of overhead, both in terms of CPU usage and memory consumption. Consequently, a repopulation strategy that attempts to keep all IMCUs as fresh as possible is likely to incur an excessive amount of overhead, leading to performance reduction rather than performance improvement. On the other hand, a repopulation strategy that allows IMCUs to become and remain largely stale for long periods of time would significantly reduce the performance benefit of having IMCUs in the first place.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.