As businesses increasingly depend on data and data size continues to increase the importance of rapid and reliable queries on such data increases.
Further, data processing has moved beyond the world of monolithic data centers housing large mainframe computers with locally stored data repositories, which is easily managed and protected. Instead, today's data processing is typically spread across numerous, geographically disparate computing systems communicating across multiple networks.
One well-known distributed database example is a No-SQL (Not Only Structured Query Language) database called Cassandra, which is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur. In one sense, Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogenous nodes where data is regularly and periodically distributed via replication amongst all the nodes in a cluster. Referring now to FIG. 1, a simplified example of the Cassandra architecture can be seen. While oftentimes thought of and referred to as a ring architecture, fundamentally it comprises a cluster of nodes 100 (e.g., Node 1, Node 2 and Node 3 , each of which is typically running on a physically separate server computing system) communicating with each other across a network (e.g., Network 110) such as a local area network, a wide area network or the internet.
Referring now to FIG. 2, an exemplary prior art cluster of nodes 200 can be seen. The data in this cluster is distributed across the nodes (labeled Node 1 , Node 2 , Node 3 , Node 4 and Node 5 in this example) which can be visualized as a ring, labeled 201 in the figure. This data distribution is both by range or partition of the overall dataset as well as by replication of the data across multiple nodes in accordance with a replication factor N specifying how many copies of a given data partition are to be replicated to other nodes in the cluster. For example, as can be seen in the figure, the dataset has been partitioned such that partition P1(0,250], which covers data ranging from 0 to 250 in the dataset, is separate from partition P2(250,500], which covers data ranging from 250 to 500 in the dataset, and partition P1 can be found stored in Node 1 , Node 2 and Node 3 while partition P2 can be found stored in Node 2 , Node 3 and Node 4. It is to be understood that such data partitioning and replication is known in the art.
Further, all nodes in Cassandra are peers and a client (i.e., an external facility configured to access a Cassandra node, typically via a JAVA API (application program interface)) can send a read or write request to any node in the cluster, regardless of whether or not that node actually contains and is responsible for the requested data. There is no concept of a master or slave, and nodes dynamically learn about each other through what is known as a gossip broadcast protocol where information is simply passed along from one node to another in the cluster rather than going to or through any sort of central or master functionality.
A node that receives a client query (e.g., a read or search operation) is commonly referred to as a coordinator for the client query; it facilitates communication with the other nodes in the cluster responsible for the query (contacting at least n replica nodes to satisfy the client query's consistency level), merges the results, and returns a single client query result from the coordinator node to the client.
For example, if Node 5 receives a client query from a client then Node 5 becomes the coordinator for that particular client query. In handling that client query, coordinator Node 5 identifies, using techniques known in the art, which other nodes contain data partitions relevant to the client query. For example, if the client query is with respect to data partitions 0 through 1000, then in this example, Node 1 (containing partition P4(750,1000] and partition P1(0,250]), Node 2 (containing partition P1(0,250] and partition P2(250,500]), Node 3 (containing partition P1(0,250], partition P2(250,500], and partition P3(500750]), Node 4 (containing partition P2(250,500], partition P3(500,750] and partition P4(750,1000]) and Node 5 (containing partition P3(500,750] and partition P4(750,1000]) are all identified. As a result, coordinator Node 5 may send a query request 203 to Node 3 with respect to data partitions P1, P2 and P3. However, should Node 3 fail to answer the query request with a query response for any of various known reasons, the entire distributed query fails. Assuming a 0.1% chance of failure at any given node, this would produce approximately a 10% client query failure rate for distributed queries that contact 100 nodes.
Fault tolerance techniques already exist, but they are usually applied to simple, single-record queries. Techniques that exist for fault tolerance over complex, multi-record queries rely on concurrently executing multiple queries against replicas of the same data, which is not optimal in terms of network and computing resource usage.
What is needed, therefore, is a simple query approach that is tolerant of such faults which still providing the benefits of querying data distributed across multiple nodes.