As businesses increasingly depend on data and data size continues to increase the importance of data integrity, i.e., the accuracy and consistency of data over time, increases.
Further, data processing has moved beyond the world of monolithic data centers housing large mainframe computers with locally stored data repositories, which is easily managed and protected. Instead, today's data processing is typically spread across numerous, geographically disparate computing systems communicating across multiple networks.
One well-known distributed database example is a No-SQL (Not Only Structured Query Language) database called Cassandra, which is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur. In one sense, Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogenous nodes where data is distributed via replication amongst all the nodes in a cluster. Referring now to FIG. 1, a simplified example of the Cassandra architecture can be seen. While oftentimes thought of and referred to as a ring architecture, fundamentally it comprises a cluster of nodes 100 (e.g., Node 1, Node 2 and Node 3, each of which is typically running on a physically separate server computing system) communicating with each other across a network (e.g., Network 110) such as a local area network, a wide area network or the internet.
Referring now to FIG. 2, an exemplary prior art cluster of nodes 200 can be seen. The data in this cluster is distributed across the nodes (labeled Node 1, Node 2, Node 3, Node 4 and Node 5 in this example) which can be visualized as a ring, labeled 201 in the figure. This data distribution is both by range or partition of the overall dataset as well as by replication of the data across multiple nodes in accordance with a replication factor N specifying how many copies of a given data partition are to be replicated to other nodes in the cluster. For example, as can be seen in the figure, the dataset has been partitioned such that partition P1(0,250], which covers data ranging from 0 to 250 in the dataset, is separate from partition P2(250,500], which covers data ranging from 250 to 500 in the dataset, and partition P1 can be found stored in Node 1, Node 2 and Node 3 while partition P2 can be found stored in Node 2, Node 3 and Node 4. It is to be understood that such data partitioning and replication across a cluster of nodes is known in the art.
Further, all nodes in Cassandra are peers and a client (i.e., an external facility configured to access a Cassandra node, typically via a JAVA API (application program interface) and sometimes referred to as a user) can send a read or write request to any node in the cluster, regardless of whether or not that node actually contains and is responsible for the requested data. There is no concept of a master or slave, and nodes dynamically learn about each other through what is known as a gossip broadcast protocol where information is simply passed along from one node to another in the cluster rather than going to or through any sort of central or master functionality.
A node that receives a client query (e.g., a read or search operation) is commonly referred to as a coordinator for the client query; it facilitates communication with the other nodes in the cluster responsible for the query (contacting one or more replica nodes to satisfy the client query's consistency level), merges the results, and returns a single client query result from the coordinator node to the client.
For example, if Node 5 receives a client query from a client then Node 5 becomes the coordinator for that particular client query. In handling that client query, coordinator Node 5 identifies, using techniques known in the art, which other nodes contain data partitions relevant to the client query. For example, if the client query is a read operation with respect to data partitions 0 through 1000, then in this example, Node 1 (containing partition P4(750,1000] and partition P1(0,250]), Node 2 (containing partition P1(0,250] and partition P2(250,500]), Node 3 (containing partition P1(0,250], partition P2(250,500], and partition P3(500,750]), Node 4 (containing partition P2(250,500], partition P3(500,750] and partition P4(750,1000]) and Node 5 (containing partition P3(500,750] and partition P4(750,1000]) are all identified. As a result, coordinator Node 5 may send a query request 203 to Node 3 with respect to data partitions P1, P2 and P3.
Write operations on a given node, while similar in some respects, are handled somewhat differently than the above-described read operation client queries in other respects. Within each node, e.g., Nodes 1-5 of FIG. 2 but referring now to FIG. 3, a sequentially written disk-based commit log 309 captures write activity by that node to ensure data durability. Data is then indexed and written to an in-memory (i.e., working memory 305) structure, called a memory table or a memtable 303, which resembles a write-back cache. Once the memory structure is full, in what is called a flush operation, the data is written from the memtable 303 in working memory 305 to long term storage (denoted “disk 307” although it may be a solid state device such as flash memory) in what is known as a Sorted String Table (SSTable) type data file 311. Once the data has been written to a data file 311 on disk 307 then the commit log 309 is deleted from the disk 307. As is known the art, these SSTable data files 311 are immutable in that updates and changes are made via new memtable entries which create new SSTable data files rather than overwriting already existing SSTable data files. A process called compaction periodically consolidates SSTables, to discard old and obsolete data.
As stated above, data is distributed via replication amongst all the nodes in the cluster. Such replication ensures there is more than one copy of a given piece of data and is thus an attempt at maintaining data integrity. However, mere replication alone does not guarantee data integrity across the various nodes in the cluster. For example, latency in communicating data between nodes can cause data in one node to differ from replica data in another node, otherwise known as a lack of data consistency between the nodes. As another example, data loss caused by some storage medium failure or data corruption can also cause a lack of data consistency between nodes. For these and other reasons, there is a need for an improved approach to maintaining data consistency across replicas in a cluster of nodes.