Field of the Invention
Embodiments of the invention generally relate to eventually-consistent data stores. More specifically, embodiments of the invention relate to systems and methods for building a point-in-time snapshot of an eventually-consistent data store.
Description of the Related Art
Companies involved in e-commerce typically maintain one or more datacenters to provide the resources to handle customer's needs on the Internet. A datacenter may consist of hundreds or thousands of server computers in a single building along with high-speed communication lines to connect those servers to the Internet. The servers may also be connected to large data stores that consist of thousands of disk drives or other non-volatile storage.
Lately, a “cloud” computing model has enabled companies to purchase computing resources on an as-needed basis from providers such as Amazon®. Cloud computing is the delivery of computing resources as a service over a network such as the Internet. Instead of the company maintaining the datacenter at a facility owned by the company, the company can “lease” use of a virtual data center provided by a third-party provider. The provider maintains the hardware at various locations throughout the world, which the company can lease and scale to match the companies needs at any given time.
One aspect of cloud services is cloud storage, where the provider leases virtual storage space to various companies or individuals. For example, Amazon® Web Services (AWS) include Amazon® Simple Storage Service (S3) that enables a user to store objects (e.g., videos, documents, etc.) at datacenters around the world using a web interface. The user can choose in which geographic region an object is stored and choose an amount of redundancy (i.e., by storing the object at multiple different datacenters) that ensures object availability even if one datacenter goes offline.
An eventually-consistent data store is a data store that sacrifices consistency for availability and partition tolerance. In other words, a system may store data redundantly in multiple locations in order to ensure that the data is available despite communication failure between nodes (partition tolerance), however, the system cannot then also ensure that the data is consistent across the multiple nodes. Eventually-consistent data stores ensure that requests for data are serviced quickly while not ensuring that the data is consistent across every node where that data may be stored.
In order to retrieve a consistent snapshot of data from the distributed data store, an administrator must either force a consistent read across all nodes (essentially preventing any requests from being processed by the system during this time) or read separately from the various nodes and reconcile the data at a later time. The former poses a large load on the data store and, in some cases, may be impossible to perform given the distributed nature of the data store. The latter requires additional services to be implemented in the data store to generate a snapshot of the state of each individual node and the ability to reconcile the data from every node at a later point in time.
Improved techniques are needed to provide data analysts with a snapshot of the eventually-consistent data store at a particular point-in-time that does not interfere with normal operation of the data store.