As is known in the art, multi-version concurrency control (MVCC) is a technique used by databases and storage systems to provide concurrent access to data. With MVCC, each user (e.g., system processes and processes that handle user traffic) sees a snapshot of the data at a particular instant in time. Any changes made by a user will not be seen by other users until the changes are committed. Among other advantages, MVCC provides non-blocking access to a shared resource (e.g., data).
Many storage systems use search trees (e.g., B+ trees) to provide efficient access to stored data. Distributed storage systems (or “clusters”) may manage thousands of search trees, each having a very large number (e.g., millions or even billions) of elements. Large search trees are typically stored to disk or other type of non-volatile memory.
To provide MVCC with search trees, a storage system may treat elements of a search tree as immutable. Under MVCC, a search tree may be updated by storing the new/updated data to unused portions of disk, and scheduling a tree update. During a tree update, at least one tree element is updated. In the case of a B+ tree, which includes a root node, internal nodes, and leaves, a tree update requires generating a new leaf to store the data, a new root node, and possibly new internal nodes. These new tree elements may be linked with existing tree elements to form a new search tree. Tree updates result in unused tree elements left on disk and, thus, storage systems typically include a process for detecting and reclaiming unused tree elements (referred to as “garbage collection”).
One technique for garbage collecting search trees under MVCC is reference counting. Here, a counter is maintained for each chunk of storage that indicates the number of referenced (or “live”) tree elements within the storage chunk. However, implementing reference counting in a distributed storage system is challenging and presents a risk of reclaiming storage capacity that is still in use.
Some existing distributed storage systems use a so-called “stop the world” approach to garbage collection whereby all users block until garbage collection completes. An alternative approach is incremental garbage collection, which does not require stopping other processes.