The present invention relates to backup hardware in electronic computer systems, and, in particular, to standby single board computers (SBC""s). Even more particularly, the present invention relates to a standby single board computer backplane system and method.
During the past decade, the personal computer industry has literally exploded into the culture and business of many industrialized nations. Personal computers, while first designed for applications of limited scope involving individuals sitting at terminals, producing work products such as documents, databases, and spread sheets, have matured into highly sophisticated and complicated tools. What was once a business machine reserved for home and office applications, has now found numerous deployments in complicated industrial control systems, communications, data gathering, and other industrial and scientific venues. As the power of personal computers has increased by orders of magnitude every year since the introduction of the personal computer, personal computers have been found performing tasks once reserved to mini-computers, mainframes and even supercomputers.
In many of these applications, personal computers perform mission critical tasks involving significant stakes and low tolerance for failure. In these environments, even a single short-lived failure of a personal computer can represent a significant financial event for its owner.
Industrial personal computers are used in critical applications that require much higher levels of reliability than provided by most personal computers. They are used for telephony applications, such as controlling a company""s voice mail or e-mail systems. They may be used to control critical machines, such as check sorting, or mail sorting for the U.S. Postal Service. Computer failures in these applications can result in significant loss of revenue or loss of critical information. For this reason, companies seek to purchase industrial personal computers, specifically looking for features that increase reliability, such as better cooling, redundant, hot-swapable power supplies or redundant disk arrays. These features have provided relief for some failures, but these systems are still vulnerable to failures of the single board computer (SBC) within the industrial personal computer system itself. If the processor, memory or support circuitry on a single board computer fails, or software fails, the single board computer can be caused to hangup or behave in such a way that the entire industrial personal computer system fails. Some industry standards heretofore dictated that the solution to this problem is to maintain two completely separate industrial personal computer systems, including redundant single board computers and interface cards. In many cases, these interface cards are very expensive, perhaps as much as ten times the cost of the single board computer.
As a result, various mechanisms for creating redundancy within and between personal computers have been attempted in an effort to provide backup hardware that can take over in the event of a failure.
One approach, mentioned above, to providing backup hardware, referred to herein as complete redundancy, involves maintaining a duplicate (or backup) personal computer and duplicate attendant interface devices, storage devices, chassis and power supplies on hand to either manually or automatically switch control in the event that a primary personal computer fails in one way or another. Unfortunately, this level of redundancy requires that all components of the primary personal computer be duplicated in the backup personal computer. While this provides arguably a maximum degree of redundancy and thus security, it requires that in many instances very expensive or non-critical hardware be duplicated.
For example, in many industrial applications, highly specialized interface boards are used to interface systems with the personal computer. These systems may involve telephony, such as cellular telephony, voice mail data acquisition, monitoring, control, and other such applications. In the event that one of these interface boards were to fail, generally, the remaining operations performed by the personal computer can continue to perform. For example, in the case of a cellular telephone system, the loss of a single interface board may mean that one xe2x80x9clinexe2x80x9d is out of service, but remaining xe2x80x9clinesxe2x80x9d remain in service. This level of failure is hardly noticeable by customers of the cellular telephony system, and thus is generally considered tolerable. On the other hand, however, these interface boards are extremely expensive and highly specialized. Thus, maintaining redundancy of these boards is both undesirable and unnecessary.
Unfortunately, prior approaches, including complete redundancy, fail to address this real world fact adequately.
For example, in U.S. Pat. No. 5,185,693, Loftis, et al., teach a backup mode of operation in which a primary personal computer can be replaced by a backup personal computer in the event a failure is detected. Failure is detected through a local area network that couples the primary personal computer to the secondary personal computer. The primary and secondary personal computers are coupled through a complicated bus switch that routes either a bus from the primary personal computer or a bus from the secondary personal computer to a plurality of remotely located (field) input/output units. The input/output units are further coupled to process instrumentation for monitoring and/or controlling an ongoing process, such as a manufacturing process.
In operation, the backup personal computer monitors the status of the primary personal computer through the local area network. Through the local area network, active data in the secondary personal computer is constantly updated with current information concerning process monitoring and control. This local area network connection may further be used to monitor the status of the primary personal computer using the secondary personal computer by, for example, deploying a watchdog timer to detect loss of bus activity. Alternatively, a separate digital output device, coupled to a terminal end of the input/output bus may use a watchdog timer to monitor the bus for a lack of bus activity and to effect the switch over from the primary personal computer to the secondary personal computer in the,event of such loss for more than a timeout period. In either case, in the event a loss of bus activity is detected, a switch switches from the primary personal computer to the secondary personal computer to gain control over the data bus leading to the remotely located input/output units.
Unfortunately, the switch employed in the illustrated device is highly complicated, and thus, is itself, sensitive to failures. In the event the switch does fail, switch over from the primary personal computer to the secondary personal computer cannot occur. Monitoring of the primary personal computer for failures is disadvantageously hindered by the fact that the secondary personal computer, in one embodiment, monitors the primary personal computerxe2x80x94and even then, monitoring is primitive, i.e., bus activity is monitored. Because of this, in the event that the secondary personal computer fails, the primary personal computer will no longer be monitored, and thus the switch over to the secondary personal computer will not occur. And, because no monitoring of the secondary personal computer is performed, this failure of the secondary personal computer will not be detected, thus meaning that the primary personal computer can go unmonitored and unbacked up for a significant period of time without detection. Similarly, in an alternative embodiment, the data output on the remote bus is used to monitor for bus activity, and effect switch over between the primary computer and the secondary computer in the event of the lack of bus activity. Unfortunately, bus activity can be generated by devices other than the primary and secondary personal computers, and thus may not be a good indicator of failure. And, with modern personal computers, a failure in one process on the primary personal computer may not result in a complete failure of the personal computer. Thus, a process can remain locked up while bus activity continues (as a result of activities of other processes on the primary personal computer or remote input/output units), and thus the failure goes undetected. As a result, bus activity may continue despite a catastrophic failure of the primary personal computer.
Furthermore, the approach offered by Loftis, et al., fails to address the principal issue outlined above. Specifically, having a backup of the primary personal computer using the secondary personal computer, while at the same time utilizing a common set of interface cards. Unlike the input/output units shown by Loftis, et al., interface cards are internal to the system of the personal computer, generally housed within a single housing therewith. The external approach offered by Loftis, et al., thus would not offer a solution to the needs of modern industrial computer users.
Other examples of backup systems are shown in U.S. Pat. No. 5,434,998 (Akai, et al.), U.S. Pat. No. 5,583,987 (Kobayashi, et al.), and U.S. Pat. No. 5,729,675 (Miller, et al.).
The present invention addresses the above and other needs.
The present invention advantageously addresses the needs above as well as other needs by providing a standby computer backplane system and method.
In one embodiment, the invention can be characterized as a computer system employing a first computer; a first bus switch coupled to the first computer; a data bus coupled to the first computer via the first bus switch; a second computer; a second bus switch coupled to the second computer, the data bus being coupled to the second computer through the second bus switch; and a monitor system coupled to the first computer, to the first bus switch, and to the second bus switch. The monitor system employs a watchdog timer coupled to a switch over circuit, wherein a watchdog timeout period exceeds a period between executions of a reset code, the reset code being included in software executing on the first computer, wherein a reset signal is generated in response execution of the reset code, thereby resetting the watchdog timer prior to the watchdog timeout period, and wherein upon a failure in the first computer the reset code is not executed, and therefore the reset signal is not generated, thereby not resetting the watchdog timer prior to the watchdog timeout period, wherein the watchdog timer generates a switch over signal in the event the watchdog timeout period is reached before the watchdog timer is reset, wherein the switch over circuit opens the first data bus switch and closes the second data bus switch in response to the switch over signal.