Referring to FIG. 1, a prior art computer system 100 is shown in which three processors 102a–c communicate with each other over a communications bus 108. The processors 102a–c may, for example, be multiple processors within a single multi-processor computer, or be processors in distinct computers communicating with each other over a network. As is well-known to those of ordinary skill in the art, processors 102a–c may transmit and receive data and/or instructions over the communications bus 108 using messages encoded according to a communications protocol associated with the communications bus 108. One example of a conventional communications bus is the Inter-IC (I2C) bus.
The processors 102a–c are coupled to bus transceivers 106a–c over local buses 104a–c, respectively. To transmit a message over the communications bus 108 (also called a “system bus” to distinguish it from the local buses 104a–c), a processor queues the message at the corresponding bus transceiver. For example, for processor 102a to transmit a message over the communications bus 108 to processor 102b, processor 102a must transmit the message over local bus 104a to the corresponding bus transceiver 106a, where the message is queued. Bus transceiver 106a then negotiates for control of the communications bus 108 in accordance with the bus protocol to become a bus master, and transmits the message over the communications bus 108 to the destination bus transceiver 106b. Bus transceiver 106b forwards the message to the destination processor 102b, and the originating bus transceiver 106a indicates to the processor 102a (by transmitting an appropriate message over the local bus 104a) that the message has successfully been sent.
Problems can arise when one of the processors 102a–c, or any other device (such as a microcontroller) which communicates over the communications bus 108 through one of the transceivers 106a, malfunctions or otherwise becomes unable to communicate over the communications bus 108. In particular, if one of the processors 102a–c fails while the corresponding one of the bus transceivers 106a is the bus master and is waiting for additional data from the failed processor, the transceiver may retain control of the bus 108 indefinitely, thereby making it impossible for other processors to communicate over the bus 108. This is referred to as “hanging” the bus 108.
One technique that has been used to address this problem is to couple watchdog timers 110a–c between each of the processors 102a–c and corresponding bus transceivers 106a–c. In general, each of the watchdog timers 110a–c transmits an interrupt signal to the corresponding bus transceiver if the corresponding processor has been inactive for more than a predetermined threshold period of time. Although the watchdog timers 110a–c may be implemented in many ways, in one implementation the watchdog timers 110a–c are timers that are initialized to a zero value and which are incremented each clock cycle. Processors 102a–c periodically reset their corresponding watchdog timer to zero. The frequency at which the processors 102a–c reset the watchdog timers 110a–c is chosen so that the value of the watchdog timers 110a–c will never reach a particular predetermined threshold value if the corresponding processor is behaving normally. If the value of a particular one of the watchdog timers 110a–c reaches the predetermined threshold value, then it is likely that the corresponding processor has crashed or that the processor is otherwise malfunctioning. In the event that one of the watchdog timers 110a–c reaches the predetermined threshold value, the watchdog timer generates an interrupt signal to the corresponding bus transceiver, causing the bus transceiver to release control of the communications bus 108, and thereby preventing the bus from hanging.
This approach does not, however, eliminate all problems that may arise when one of the processors 102a–c fails. Consider an example in which processor 102a has failed. If processor 102b attempts to transmit a message to processor 102a over the communications bus 108, the message will fail to be transmitted successfully because processor 102a is in a failed state.
What is needed, therefore, are improved techniques for detecting and responding to the failure of a device coupled to a communications bus.