Field of the Invention
This invention generally relates to database management systems and more specifically to detecting failures during the processing of a distributed database system.
Description of Related Art
The above-identified U.S. Pat. No. 8,224,860 discloses a distributed database management system comprising a network of transactional nodes and archival nodes. Archival nodes act as storage managers for all the data in the database. Each user connects to a transactional node to perform operations on the database by generating queries for being processed at that transactional node. A given transactional node need only contain that data and metadata as required to process queries from users connected to that node. This distributed database is defined by an array of atom classes, such as an index class, and atoms where each atom corresponds to a different instance of the class, such as index atom for a specific index. Replications or copies of a single atom may reside in multiple nodes wherein the atom copy in a given node is processed in that node.
In an implementation of such a distributed database asynchronous messages transfer among the different nodes to maintain the integrity of the database in a consistent and a concurrent state. Specifically each node in the database network has a unique communication path to every other node. When one node generates a message involving a specific atom, that message may be sent to every node that contains a replication of that specific atom. Each node generates these messages independently of other nodes. So at any given instant multiple nodes will contain copies of a given atom and different nodes may be at various stages of processing that atom. As the operations in different nodes are not synchronized it is important that the database be in a consistent and concurrent state at all times.
A major characteristic of such distributed databases is that all nodes be in communication with each other at all times so the database is completely connected. If a communications break occurs, the database is not considered to be connected. One or more nodes must be identified and may be removed from the network in an orderly manner. Such identification and removal must consider that any node can fail at any given time and that a communications break can occur only between two nodes or that multiple breaks can occur among several nodes. The identification of a node or nodes for removal must be accomplished in a reliable manner. Moreover such an identification should enable failure processes to resolve a failure with minimal interruption to users.