Not applicable.
Not applicable.
1. Field of the Invention
The present invention relates generally to fault tolerance in microcomputer systems, and in particular to computer systems adapted to periodically check for failures. More particularly, the present invention relates to personal computer system capable of transmitting and receiving heartbeat messages at an adjustable rate for improved fault tolerance.
2. Background of the Invention
Although early microcomputers were popular with hobbyists for such computing tasks such as word processing and video games, early microcomputer systems did not match the superior data processing speed of larger mainframes and minicomputers. Consequently, most businesses and organizations that required a high level of data processing and communications, including financial, academic, and scientific institutions, traditionally relied on networks of mainframes and minicomputers for computing tasks. In recent years, microcomputers, which may be generally defined as microprocessor-based, programmable electronic devices for retrieving, storing, and processing data, have developed rapidly in terms of processor speed, memory speed and capacity, and interconnectability. As microcomputing capabilities approach those of mainframes and minicomputers, networks of personal computer systems increasingly are utilized for the heavy data processing and communications jobs once handled by the larger machines.
Because of the sheer amount of data that must be processed by some organizations (e.g., financial and research institutions) and also the sensitivity of some data to computer system faults (such as air traffic control data and banking transactions), mainframe computers usually have incorporated measures to ensure fault tolerance, or the capability of a computer system or network of computers to continue operating even if an internal hardware or software failure occurs. Hence, fault tolerant systems are designed to operate essentially without interruptions. One method of providing fault tolerance is to combine a primary computer system with a backup system. A backup system generally waits in a standby mode without processing data until the primary system fails. When the primary system fails, the backup system replaces the primary system. The calculations of the primary system can thus be continued by the backup system, albeit with a slight interruption before the backup system is activated. Another fault tolerance scheme involves combining two xe2x80x9credundantxe2x80x9d computer systems which process the same data concurrently. If one of the systems fails, then the data may still be processed by the working system. A major drawback to redundant systems is their significant expense, due to the fact that two or more data processing systems are required instead of just one. In one type of hybrid system, two or more computers operate independently, processing different data but attached to a common network. When a computer fails, the failed machine is disabled and the remaining computers on the network embrace the workload of the failed computer.
Because the cost of a typical microcomputer (or xe2x80x9cpersonal computerxe2x80x9d) has remained well below the cost of a typical mainframe even as personal computing capabilities have soared, it has become increasingly cost effective to use personal computer (PC) systems for tasks that were once reserved only for mainframes. In addition, PC manufacturers have encouraged using personal computers for these tasks by introducing fault tolerance mechanisms into some recent computer designs. Fault tolerant PC networks have been introduced, as well. Personal computer networks generally include one or more personal computers configured as network servers which manage the network and the transfer and storage of data within the network. Network servers generally comprise an abundance of resources, including one or more very fast processors, a large amount of random access memory (RAM), and an abundance of disk storage space. Further, network servers typically operate at fast input/output (I/O) speeds and are given more frequent access to the network than are other computers on the network. The abundance of resources and increased network access allow each network server to transfer files and data efficiently to a large number of networked computers. Because a single failure in a network server may cause network problems or even downtime to many computer users, fault tolerant network servers generally have benefited network performance and have helped to minimize network downtime.
In one network fault tolerance scheme, two servers operate independently of each other but are capable of handling an increased workload if one of the servers fails. In such a scheme, a first server periodically transmits a xe2x80x9cheartbeatxe2x80x9d message over the network to a second server to indicate that the first server is functioning properly. If the second server does not receive the heartbeat message within a predetermined time interval, then the second server concludes that the first server has failed and seizes the workload of the first server. The second server also transmits a periodic heartbeat message to the first server, so that the first server may process data in place of the second server if the second server fails. Thus, each server essentially provides backup support for the other server in case of a server failure. The heartbeats typically are transmitted infrequently in order to minimize the level of network traffic.
One problem with the heartbeat scheme is that because the heartbeat messages are transmitted at fixed time intervals (or xe2x80x9cheartbeat periodsxe2x80x9d), the heartbeat scheme may be unsuitable for networks which cannot permit downtime greater than one heartbeat period. For instance, if one server fails immediately after transmitting a heartbeat, then it will take almost one full heartbeat period before the second server detects and corrects for the failure. In some sensitive networks, such excessive downtime conceivably could severely degrade network service, cause network instability, or even result in human catastrophe if the network is involved in transportation or safety systems. Conversely, systems needing only a moderate level of fault tolerance might not require a frequent heartbeat. Because all messages sent over a network require some amount of network capacity (or xe2x80x9cbandwidthxe2x80x9d), a network server transmitting heartbeats at a high rate may absorb large amounts of network bandwidth. Thus, the optimum heart rate may vary according to the type of information being processed and the processing speed. Because it is difficult to design a one-size-fits-all heartbeat scheme, such methods often are not well-suited for a wide range of user applications.
While conventional heartbeat schemes are capable of monitoring whether or not a computer system has failed, these methods usually do not help to predict when failures might occur. If computer failures could be predicted before happening, then corrective actions could be taken as soon as possible to prevent or minimize system downtime. Current heartbeat schemes fail to incorporate prediction measures, however.
Thus, there remains a need for a flexible and responsive fault tolerance scheme capable of determining as well as predicting system performance. Such a scheme preferably would be able to intelligently optimize the heart rate to improve response time during a system failure. Despite the apparent advantages of such a system, to date no one has devised a computer system that offers these benefits.
Accordingly, the present invention discloses a computer system comprising two central processing units (CPUs), a bridge logic device coupled to the CPUs, and a network interface card (NIC) coupled to the bridge logic, each device transmitting variable-rate heartbeats to a heartbeat monitor. The computer system further includes a main memory device coupled to the bridge logic. In a preferred embodiment, the heartbeats transmitted by the bridge logic device indicate that the main memory is properly functioning. Similarly, the heartbeats transmitted by the NIC represent heartbeats transmitted by another computer system which is coupled to the NIC via a network such as a local area network (LAN). Each CPU transmits heartbeats to the heartbeat monitor to indicate that it is functioning properly.
The heartbeat monitor comprises a register file including an HB register for each heartbeat sender that records incoming heartbeats. In addition to receiving heartbeats, the heartbeat monitor is capable of determining initial heart rates for each component transmitting a heartbeat (or xe2x80x9cheartbeat senderxe2x80x9d) and is farther capable of adaptively adjusting the heartbeat intervals thereafter. The register file also includes and INTERVAL register, an MFG register, an MTBF register, and an MSG register for each heartbeat sender. The INTERVAL register specifies the heartbeat interval for the associated sender. The MFG and MTBF registers store the manufacturing date and mean time between failure, respectively, of the associated sender. The MSG register is used for transmitting messages between the heartbeat monitor and the associated heartbeat sender.
The heartbeat monitor further includes a control logic coupled to the register file and a plurality of adaptive interval controllers coupled to the control logic, each interval controller associated with a different heartbeat sender. The control logic further asserts interrupt signals to the CPUs, the bridge logic, and the NIC. A temperature sensor is also included within the heartbeat monitor and provides a temperature warning signal to the interval controllers. An adaptive interval controller determines an initial heartbeat interval for the associated heartbeat sender based on the age of the sender, which can be determined from the MFG and MTBF registers. If the age of the sender is younger than the MTBF, then a longer heartbeat interval is specified. Conversely, if the age of the sender is older than the MTBF, then a shorter heartbeat interval is specified. Once an appropriate initial heartbeat interval is determined, an adaptive interval generator transmits the interval to the register file and begins transmitting a periodic PULSE signal to the monitor control logic having a period equal to the heartbeat period. The monitor control logic then notifies the heartbeat sender of the initial heartbeat interval, and the heartbeat sender commences transmitting heartbeats at the appropriate intervals.
The adaptive interval generator comprises an age counter for tracking the age of the sender, an MTBF register for holding the MTBF value, a comparator receiving the values of the age counter and MTBF register, and an interval determination logic receiving a COMPARE signal from the comparator. The adaptive interval generator further includes an error period counter and a timing interval counter, each coupled to the interval determination logic. The COMPARE signal is asserted if the value of the age counter is greater than or equal to the MTBF value, indicating that the sender is older than its MTBF. The interval determination logic thus determines a faster initial heart rate if the COMPARE signal is asserted. The age counter continuously increments, tracking the age of the sender. Thus, if the initial heartbeat interval is chosen for a sender that is younger than its MTBF, then the heart rate is increased when the value of the age counter exceeds the MTBF. The interval determination logic asserts RATE signals to the timing interval counter and to the control logic which indicate the heartbeat intervals. In response to the RATE signals, the timing interval counter measures each heartbeat interval, asserting a PULSE signal to the control logic after each successive interval. In response to each PULSE signal, the control logic checks the register file for a new heartbeat to determine whether the heartbeat sender is still functioning.
If the temperature sensor measures a temperature that exceeds a predetermined value, then the adaptive interval controllers respond by adjusting the heart rates of the associated senders. The increased heart rate (corresponding to a decreased heartbeat interval) allows the heartbeat monitor to check the heartbeat senders more frequently for failures,
Along with a heartbeat message, a heartbeat sender may include warning or error messages indicating problems within the heartbeat sender. The adaptive interval generator associated with that sender responds to the warning message by temporarily decreasing the heartbeat interval to enable more frequent monitoring of the sender. Warning or error messages may cause a temporary increase in the heartbeat interval in some embodiments, however. After a predetermined error period, the heartbeat interval is then returned to normal unless the error condition persists. The error period is measured by the error period counter.
Thus, the present invention comprises a combination of features and advantages that enable it to substantially advance the art by providing an adaptive heartbeat monitor that dynamically changes the heart rates according to system demands. These and various other characteristics and advantages of the present invention will be readily apparent to those skilled in the art upon reading the following detailed description of a preferred embodiments of the invention and by referring to the accompanying drawings.