Distribution of various components of a software stack can in some cases provide (or support) fault tolerance (e.g., through replication), higher durability, and less expensive solutions (e.g., through the use of many smaller, less-expensive components rather than fewer large, expensive components). However, databases have historically been among the components of the software stack that are least amenable to distribution. For example, it can be difficult to distribute databases while still ensuring the so-called ACID properties (e.g., Atomicity, Consistency, Isolation, and Durability) that they are expected to provide.
While most existing relational databases are not distributed, some existing databases are “scaled out” (as opposed to being “scaled up” by merely employing a larger monolithic system) using one of two common models: a “shared nothing” model, and a “shared disk” model. In general, in a “shared nothing” model, received queries are decomposed into database shards (each of which includes a component of the query), these shards are sent to different compute nodes for query processing, and the results are collected and aggregated before they are returned. In general, in a “shared disk” model, every compute node in a cluster has access to the same underlying data. In systems that employ this model, great care must be taken to manage cache coherency. In both of these models, a large, monolithic database is replicated on multiple nodes (including all of the functionality of a stand-alone database instance), and “glue” logic is added to stitch them together. For example, in the “shared nothing” model, the glue logic may provide the functionality of a dispatcher that subdivides queries, sends them to multiple compute notes, and then combines the results. In a “shared disk” model, the glue logic may serve to fuse together the caches of multiple nodes (e.g., to manage coherency at the caching layer). These “shared nothing” and “shared disk” database systems can be costly to deploy and complex to maintain, and may over-serve many database use cases.
In traditional database systems, the data managed by a database system is stored on direct attached disks. If a disk fails, it is replaced and then must be reloaded with the appropriate data. For example, in many systems, crash recovery includes restoring the most recent snapshot from a backup system and then replaying any changes made since the last snapshot from that point forward. However, this approach does not scale well to large databases.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.