B+-trees are used in a many applications, including but not limited to file systems (FS) and database management systems (DBMS). In relational DBMS's B-trees may serve as a primary storage for the table rows as well as a secondary index for them. In a similar way, document-oriented DBMS's B-trees can serve as a primary storage or secondary index for documents. In DBMS's utilizing other data models, B-trees likewise can serve as primary or secondary data storage.
Here and above we will describe a B+-tree and its modifications storing table rows. However, all the statements about such trees are also true for storing documents or other data items. Data items can have an extractable key, and these keys can be comparable. That is for given two keys A and B, key A is less than, equal to, or greater than B.
While typically used for storing the data on the disk, B+-trees are also frequently used as a main memory data structure. B+-trees use relatively large nodes in comparison with red-black trees or AVL trees. Such node sizes appear to be efficient for managing a large amount of data in main memory, because it provides better data locality.
Here and below we will refer to any persistent data storage device as disk. Persistent storage devices include but not limited to magnetic disk device, solid state drive (SSD), non-volatile random access memory (NVRAM) and network-distributed media.
Considering B+-tree stored on disk, the nodes of a B+-tree can be copied from the disk to the main memory before their contents can be accessed by a computer's processor. If the main memory area allocated to store contents of nodes is already full, then one of the already loaded nodes can be evicted from main memory. If that node has been modified, then it can be written back to disk before being removed from main memory. Typically these routines in DBMS are controlled by separate subsystem called a “buffer manager.”.
Many of DBMS's implement transactions with ACID properties, i.e., Atomicity, Consistency, Isolation, and Durability, which may create restrictions on the DBMS.
The durability requirement means that after a system crash, the DBMS should be able to recover the database once disk media was saved. Recovery needs to bring the database to a consistent state, which restores all transactions confirmed before the crash. Therefore, any transaction needs to reach to the disk media before being confirmed.
Transactions might reach disk media by writing all the data, which it has to modify. However, this approach has number of shortcomings. First, data modified by transactions may rely at random places of disk media. That means the DBMS has to complete a lot of random disk writes, which could cause a serious performance penalty. Second, the DBMS typically divides information into blocks, which are not necessarily being written atomically. So, DBMS crashes may happen in the middle of a block write, making it difficult to recover after such a partial write.
A Common approach to implement durability and evade the above difficulties is the write-ahead log (WAL). Using WAL, writing modified pages can be postponed. For every data modification, the WAL record is updated before the modification is completed. WAL is an append-only journal, which is stored on the disk media.
Even if modifications made to the pages are lost due to a DBMS crash, all necessary information is saved in the WAL. During recovery, lost changes are replayed using the WAL. It's important for WAL records to be idempotent. So that if the modification is already applied to the disk, the corresponding WAL record just does nothing.
WAL records can reach disk, before corresponding changes in the data itself do. In order to provide that, buffer manager is tightly coupled with WAL subsystem. Transactions can be confirmed after all corresponding WAL records are written. During recovery, DBMS can replay WAL records bringing the database to the state that has all confirmed transactions.
Transaction isolation requirements mean that concurrent execution of the transactions produces the same effect as some serial execution of those same transactions. Since strict following of this isolation definition could be costly, SQL standards define more relaxed isolation levels. Transaction isolation could be implemented using either optimistic or pessimistic locking when a reader and a writer of data could block each other. Another approach is multiversion concurrency control (MVCC) when a writer makes a new version of a modified data item while a reader could see an older version of the same data item without being blocked. MVCC allows transactions to see immutable snapshots of the database.
There are various ways to implement MVCC. For instance, a physical copy of a row could be marked with its creator and destructor transaction identifiers. Also, old versions of data items could be placed into a separate data structure called an “undo log”, which could be either block-level or row-level or mixed. Primary and secondary data structures can support snapshots independently. Alternatively, secondary data structures can contain links to multiple possible versions of data, relying on a visibility check while fetching from primary data structure.