This invention relates generally to fault management in distributed database systems, and more particularly to management of resets in mirrored database segments.
In distributed database systems, data is replicated (mirrored) to sets of different nodes having different database segments for fault tolerance reasons. This requires at least two replicated segments, one is a primary segment and one is a mirror segment. If the primary segment becomes unusable, a mirrored segment can be promoted to be primary to keep the system online. A fault tolerance service (FTS) maintains health information for each segment, and uses the information to decide whether a mirror should be promoted to primary. FTS can be centralized to run on a master node or distributed to two or more segment nodes using a consensus protocol. FTS periodically checks the health of each primary and mirror in a primary-mirror pair by probing the nodes. If one segment has a problem and the segments are synchronized, FTS transitions the healthy segment to become primary and to enter a low-availability mode, while the faulty segment is marked as mirror and unavailable. For the period that the mirror is unavailable, the primary keeps track of any updates to the stored data. If a failed mirror is recovered, it is re-synchronized by receiving and applying the pending updates from the primary. Until re-synchronization completes, the data stored in the mirror are not consistent with the data on the primary so the mirror cannot be used for failover.
Each pair of primary and mirror segments is synchronized using a replication protocol. Since the primary and mirror are physically located on different machines, they monitor the states of their communications and replication infrastructure, and report to FTS when probed. For example, if a mirror encounters a failure while trying to receive data from its primary, it will report this event to FTS. If FTS determines that the primary is offline, it promotes the mirror to primary.
Certain software failures may be severe enough to require resetting a single component (e.g., a process) of the database system, a group of processes, or even the operating system by restarting the server machine, and distributed database systems have such reset mechanisms. For instance, if a process crashes while holding a lock, the lock is never released so that one or more processes will likely deadlock waiting for this lock. Also, a process that detects corruption in shared memory, e.g., due to hardware failures or software bugs, must prevent other processes from transferring corrupted data to the disk and overwriting healthy data. It is, therefore, important that the system have a mechanism to reset, i.e. immediately stop all running server processes and threads, re-initialize shared memory, and restart all required processes. Any server application has such a reset mechanism. For instance, distributed database systems that are built on PostgreSQL have a reset mechanism known as “postmaster reset” for this purpose.
Because replication is a “stateful” protocol (the primary and mirror keep track of their ability to communicate and replicate data), if one node detects a replication problem, it may report it to FTS and request action be taken to keep the system operating. On a primary or mirror reset, replication processes are restarted so the replication framework may need to be reset. A reset stops all processes, and cleans and re-initializes shared memory that stores information about the current replication state.
When a reset occurs, communication between a primary and its mirror is interrupted, necessitating system reconfiguration. For example, if a primary resets, it will break and reinitialize communication with its mirror, and may fail to respond to a health check from FTS, causing the mirror or FTS to assume that the primary is faulty. The result of this is that primary will be marked as offline, and the system will no longer be fault tolerant. On current distributed databases, both the primary and the mirror can initiate a reset if one detects an event that requires reset.
Since there are three different remote nodes (master, primary and mirror) that can interact, the timing of occurrence and duration of events can create problems. For example, if a mirror detects a replication fault before it receives a reset request from the primary, it may erroneously report to the FTS on the master node that the primary is faulty, causing the FTS to designate the primary as being faulty. Additionally, if a reset stops a running process on a segment before it completes responding to an FTS probe, FTS may assume the segment is faulty and transition it to a different state, causing disruption of replication and possibly system reconfiguration. Furthermore, the mirror will be promoted to primary and caused to enter a low availability mode. Any currently executing operation (query) will be either suspended or cancelled since its execution requires coordination with the new primary. The system will remain offline and unavailable until either the replication mechanism between the primary and mirror is re-established or the mirror is transitioned to primary in low availability mode.
Established reset mechanisms have significant undesirable consequences and are accompanied by a number of other problems, some of which have been mentioned. Reestablishing communications and reconfiguring the system are heavyweight, time-consuming processes. What are needed are reset mechanisms that execute autonomously between a primary and mirror without external coordination, are transparent to FTS to simplify the fault detection logic, and minimize down time and disruption of user experience.
It is desirable to provide new and improved reset systems and methods that address the foregoing and other problems of known reset approaches, and it is to these ends that the present invention is directed.