In some computer systems, it is important to maximize the availability of critical data and applications. Generally, this is achieved by using a fault tolerant system or by using high availability (“HA”) software, which is implemented on a cluster of multiple nodes.
A fault tolerant computer system includes duplicate hardware and software. For example, a fault tolerant server may have redundant power supplies, storage devices, fans, network interface cards, and so on. When one or more of these components fails, the fault is detected, and a redundant component takes over to continue servicing data and application requests. However, these systems typically are tightly coupled to the operating system and may rely on some operating system intervention to support full recovery.
HA software also provides fault detection and correction procedures. In contrast to fault tolerant systems, HA software is implemented on two or more nodes, which are arranged in a “cluster” and communicate over a link (e.g., a network). Typically, one node operates as the “master” for a particular application, where the master is responsible for executing the application. One or more other nodes within the cluster are “slaves” for that application, where each slave is available to take over the application from a failed master, if necessary.
Generally, one disadvantage to an HA system is that failure recovery typically takes much longer than it would with a fault tolerant system. Therefore, significant system downtimes may be perceived by system users. One reason for the relatively slow failure recovery times is the way that failures are detected and responded to.
In some systems, each slave periodically “pings” other nodes to determine whether they are reachable. If a slave determines that a master node is unreachable before expiration of a certain timeout period, the slave declares a failure and attempts to take over as master. Because this process relies on timeout periods and network communications, it provides slower recovery than is possible using fault tolerant systems. Besides being somewhat slower to recover, another disadvantage to these systems is that it is not possible to detect a failure of a single application within a master node. Instead, the entire node must fail in order for a failure to be detected.
The “Time Synchronization Protocol” (TSP) is an example of such an HA protocol, which is used by the clock synchronization programs timed and TEMPO. TSP supports messages for the election that occurs among slaves when, for any reason, the master disappears. Basically, the election process chooses a new master from among the available slaves when the original master ceases to send out heartbeat messages. All of these processes consume precious time and may extend the system recovery time.
However synchronizing critical controller data can be time consuming as well and slows the overall performance due to the added overhead imposed on the controller. The industry is searching for a solution that can minimize the time it takes to synchronize the critical controller data without adding undue cost.
Thus, a need still remains for storage controller system with data synchronization. In view of the Enterprise system requirements for uninterrupted reliable operation and increasing performance, it is increasingly critical that answers be found to these problems. In view of the ever-increasing commercial competitive pressures, along with growing consumer expectations and the diminishing opportunities for meaningful product differentiation in the marketplace, it is critical that answers be found for these problems. Additionally, the need to reduce costs, improve efficiencies and performance, and meet competitive pressures adds an even greater urgency to the critical necessity for finding answers to these problems.
Solutions to these problems have been long sought but prior developments have not taught or suggested any solutions and, thus, solutions to these problems have long eluded those skilled in the art.