In a clustered database system, multiple “nodes” may have access to the same on-disk copy of a database. Typically, each node is a computing device with its own local memory and processors that are running one or more database instances. The database server instances on each of the nodes may receive queries to access the data. A query coordinator may assign work to be performed for the query among multiple worker threads running on the nodes in the database system, in order to process the query in parallel.
A complex query may be executed in multiple stages. During execution of the query, results generated by a stage may be used by another subsequent stage. These results are temporarily stored on a shared disk so that they can be accessed by other nodes using temporary tables.
However, storing the temporary table on the shared disk has significant overhead costs. The temporary segments for each worker must be allocated beforehand, and the size may be much larger than what is required for queries with small results. Furthermore, reading and writing to a disk is slower than reading and writing from local volatile memory.
If the query is processed in parallel by a plurality of worker threads, each worker thread may write temporary results to a segment on the shared disk. When all the worker threads have finished writing data to their respective segment, the query coordinator merges the segments. A merge operation is a metadata operation that defines the extents from various temporary segments as belonging to the same temporary segment. Subsequent stages may read the plurality of extents now merged into a single temporary segment as from a single shared table.
Although a merge may only involve metadata, the merge is performed serially by a single process. Thus, even though the rest of an operation may be processed in parallel, the merge may take a long time and cause significant delays. For example, for a small set of results, the plurality of segments may take only a short amount of time to write to, but then there would be a delay while the query coordinator merges the segments.
Furthermore, reading and writing to a disk is slower than reading and writing to volatile memory. For parallel queries, after the worker threads have finished writing to the temporary table, there is a delay as the query coordinator needs to merge the segments at the end of the table population process.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.