The present invention relates to backup hardware in electronic computer systems, and, in particular, to standby single board computers (SBC""s). Even more particularly, the present invention relates to a standby single board computer backplane system and method.
During the past decade, the personal computer industry has literally exploded into the culture and business of many industrialized nations. Personal computers, while first designed for applications of limited scope involving individuals sitting at terminals, producing work products such as documents, databases, and spread sheets, have matured into highly sophisticated and complicated tools. What was once a business machine reserved for home and office applications, has now found numerous deployments in complicated industrial control systems, communications, data gathering, and other industrial and scientific venues. As the power of personal computers has increased by orders of magnitude every year since the introduction of the personal computer, personal computers have been found performing tasks once reserved to mini-computers, mainframes and even supercomputers.
In many of these applications, personal computers perform mission critical tasks involving significant stakes and low tolerance for failure. In these environments, even a single short-lived failure of a personal computer can represent a significant financial event for its owner.
Industrial personal computers are used in critical applications that require much higher levels of reliability than provided by most personal computers. They are used for telephony applications, such as controlling a company""s voice mail or e-mail systems. They may be used to control critical machines, such as check sorting, or mail sorting for the U.S. Postal Service. Computer failures in these applications can result in significant loss of revenue or loss of critical information. For this reason, companies seek to purchase industrial personal computers, specifically looking for features that increase reliability, such as better cooling, redundant, hot-swapable power supplies or redundant disk arrays. These features have provided relief for some failures, but these systems are still vulnerable to failures of the single board computer (SBC) within the industrial personal computer system itself. If the processor, memory or support circuitry on a single board computer fails, or software fails, the single board computer can be caused to hangup or behave in such a way that the entire industrial personal computer system fails. Some industry standards heretofore dictated that the solution to this problem is to maintain two completely separate industrial personal computer systems, including a redundant single board computers and interface cards. In many cases, these interface cards are very expensive, perhaps as much as ten times the cost of the single board computer.
As a result, various mechanisms for creating redundancy within and between personal computers have been attempted in an effort to provide backup hardware that can take over in the event of a failure.
One approach, mentioned above, to providing backup hardware, referred to herein as complete redundancy, involves maintaining a duplicate (or backup) personal computer and duplicate attendant interface devices, storage devices, chassis and power supplies on hand to either manually or automatically switch into control in the event that a primary personal computer fails in one way or another. Unfortunately, this level of redundancy requires that all components of the primary personal computer be duplicated in the backup personal computer. While this provides arguably a maximum
degree of redundancy and thus security, it requires that in many instances very expensive or non-critical hardware be duplicated.
For example, in many industrial applications, highly specialized interface boards are used to interface systems with the personal computer. These systems may involve telephony, such as cellular telephony, voice mail data acquisition, monitoring, control, and other such applications. In the event that one of these interface boards were to fail, generally, the remaining operations performed by the personal computer can continue to perform. For example, in the case of a cellular telephone system, the loss of a single interface board may mean that one xe2x80x9clinexe2x80x9d is out of service, but remaining xe2x80x9clinesxe2x80x9d remain in service. This level of failure is hardly noticeable by customers of the cellular telephony system, and thus is generally considered tolerable. On the other hand, however, these interface boards are extremely expensive and highly specialized. Thus, maintaining redundancy of these boards is both undesirable and unnecessary.
Unfortunately, prior approaches, including complete redundancy, fail to address this real world fact adequately.
For example, in U.S. Pat. No. 5,185,693, Loftis, et al., teach a backup mode of operation in which a primary personal computer can be replaced by a backup personal computer in the event a failure is detected. Failure is detected through a local area network that couples the primary personal computer to the secondary personal computer. The primary and secondary personal computers are coupled through a complicated bus switch that routes either a bus from the primary personal computer or a bus from the secondary personal computer to a plurality of remotely located (field) input/output units. The input/output units are further coupled to process instrumentation for monitoring and/or controlling an ongoing process, such as a manufacturing process.
In operation, the backup personal computer monitors the status of the primary personal computer through the local area network. Through the local area network, active data in the secondary personal computer is constantly updated with current information concerning process monitoring and control. This local area network connection may further be used to monitor the status of the primary personal computer using the secondary personal computer by, for example, deploying a watchdog timer to detect loss of bus activity. Alternatively, a separate digital output device, coupled to a terminal end of the input/output bus may use a watchdog timer to monitor the bus for a lack of bus activity and to effect the switch over from the primary personal computer to the secondary personal computer in the event of such loss for mor than a timeout period. In either case, in the event a loss of bus activity is detected, a switch switches from the primary personal computer to the secondary personal computer to gain control over the data bus leading to the remotely located input/output units.
Unfortunately, the switch employed in the illustrated device is highly complicated, and thus, is itself, sensitive to failures. In the event the switch does fail, switch over from the primary personal computer to the secondary personal computer cannot occur. Monitoring of the primary personal computer for failures is disadvantageously hindered by the fact that the secondary personal computer, in one embodiment, monitors the primary personal computerxe2x80x94and even then, monitoring is primitive, i.e., bus activity is monitored. Because of this, in the event that the secondary personal computer fails, the primary personal computer will no longer be monitored, and thus the switch over to the secondary personal computer will not occur. And, because no monitoring of the secondary personal computer is performed, this failure of the secondary personal computer will not be detected, thus meaning that the primary personal computer can go unmonitored and unbacked up for a significant period of time without detection. Similarly, in an alternative embodiment, the data output on the remote bus is used to monitor for bus activity, and effect switch over between the primary computer and the secondary computer in the event the lack of bus activity. Unfortunately, bus activity can be generated by devices other than the primary and secondary personal computers, and thus may not be a good indicator of failure. And, with modern personal computers, a failure in one process on the primary personal computer may not result in a complete failure of the personal computer. Thus, a process can remain locked up while bus activity continues (as a result of activities of other processes on the primary personal computer or remote input/output units), and thus the failure goes undetected. As a result, bus activity may continue despite a catastrophic failure of the primary personal computer.
Furthermore, the approach offered by Loftis, et al., fails to address the principal issue outlined above. Specifically, having a backup of the primary personal computer using the secondary personal computer, while at the same time utilizing a common set of interface cards. Unlike the input/output units shown by Loftis, et al., interface cards are internal to the system of the personal computer, generally housed within a single housing therewith. The external approach offered by Loftis, et al., thus would not offer a solution to the needs of modern industrial computer users.
Other examples of backup systems are shown in U.S. Pat. No. 5,434,998 (Akai, et al.), U.S. Pat. No. 5,583,987 (Kobayashi, et al.), and U.S. Pat. No. 5,729,675 (Miller, et al.).
The present invention addresses the above and other needs.
The present invention advantageously addresses the needs above as well as other needs by providing a standby computer backplane system and method.
In one embodiment, the invention can be characterized as a computer system comprising a first computer coupled to a primary PCI bus via a first PCI bus switch and a second computer coupled to the primary PCI bus via a second PCI bus switch. A monitor system is coupled to both the first and second computers as well as the first and second PCI bus switches. In the event of a malfunction in the first computer, the monitor system decouples the first computer from the primary PCI bus, by opening the first PCI bus switch and coupling the second computer to the primary PCI bus by closing the second PCI bus switch.
In another embodiment, the present invention can be characterized as a computer system comprising a computer coupled to a primary PCI bus via a PCI bus switch. A monitor system is coupled to both the computer and the PCI bus switch. In the event of a malfunction in the computer, the monitor system decouples the computer from the primary PCI bus by opening the PCI bus switch and produces a signal indicating that a malfunction has occurred. In a preferred embodiment, the signal may be an illuminated light. The illuminated light may be located on a housing of the computer system.
In yet another embodiment, the present invention can be characterized as a method of monitoring a computer system comprising coupling a first computer to a primary PCI bus via a first PCI bus switch and coupling a second computer to the primary PCI bus via a second PCI bus switch. Further comprising, coupling the first and second computers and the first and second PCI bus switches to a monitor system. Additionally, producing a signal in the first computer at a regular interval and resetting a watchdog timer in the monitor system in response to the signal. Further comprising, decoupling the first computer from the primary PCI bus by opening the first PCI bus switch and coupling the second computer to the primary PCI bus by closing the second PCI bus switch in the event the watchdog timer is not reset.
In another embodiment, the invention can be characterized as a system comprising a first computer coupled to a primary PCI bus via a first PCI bus switch and a second computer coupled to the primary PCI bus via a second PCI bus switch. A monitoring system is coupled to the first and second computers and the first and second PCI bus switches. Within the monitoring system is a watchdog timer which is periodically reset in response to signals from the first computer. A switch over circuit is coupled to the watchdog timer such that in the event a malfunction occurs in the first computer, a watchdog timeout period is exceeded when the signals are not sent to the watchdog timer and is therefore not reset resulting in arming the switch over circuit so that the monitoring system decouples the first computer from the primary PCI bus, by opening the first PCI bus switch and coupling the second computer to the primary PCI bus by closing the second PCI bus switch.