Debugging of the hardware can be both a complex and creative task in the development of a computer system. This task is particularly challenging for multiprocessor asynchronous systems because of the difficulty in containing the erroneous data caused by a defect and in obtaining repeatable state transitions into an error state. When a defect is present, we assume that it will introduce a fault which may be observed by inducing the system to traverse into an anomalous error state.
In debugging a multi-processor system, if a fault has occurred in one of the processors, it is desirable to halt the other processors so that the erroneous data will not be spread by the communication network. It is further desirable that we can induce the system into repetitively entering the error state so as to use other means for debugging to locate the source of defect.
The Monsoon multiprocessor system, described in Related Invention No. 1 is an example of a multiple synchronous subsystem interconnected by an asynchronous network.
Related Invention No. 2 discloses a synchronization (sync) network interconnecting Monsoon processors for halting their operation substantially simultaneously in order to contain any erroneous data.
A purpose of the present invention is to provide debugging methodologies which provide repeatability in the state transitions. Obtaining repeatability in the state transitions for a multi-processor system is difficult because the difference in clock edges and the uncertainty in network propagation delay times prevent the same inputs from arriving in the same order to a state. If there were a way to allow the system to take little `steps` which are alone repeatable state transitions, we could ensure the arrival and order of arrival of inputs to the same states.
Once we have the same inputs to the same states every time, then, by initializing the system with the same set of states, we can provide repeatability of state transitions every time. By having repeatability, if we have a malfunction we can use other methods of debugging to debug it.
After a system is designed and built, the user is faced with a difficult task of how to debug it. Debugging a system means to find all the defects which may be present that can cause the system not to function as planned. If we were to restrict our attention just to synchronous systems and the interconnection of synchronous systems, a malfunction is expressed as an incorrect state transition in the normal path of state transitions.
Assuming we are examining a defect of the type mentioned above, how do we know if there is a defect on a particular system? One method is by analyzing the state transitions. A fault caused by the defect resulting in a malfunction will be expressed as an anomalous state transition which does not appear in normal state transitions.
For a synchronous system without any defects, as long as the initial states and inputs are the same, it will always traverse a set trajectory of state transitions. This trajectory of state transitions may be changed when a defect occurs because the defect might cause a fault which may induce the system to traverse into an error state. By analyzing all the state variables (i.e., all the bits that reside on storage elements) which compose every state, we can detect if a system is in an error state.
For a large system, examining every bit of state variable for each clock cycle becomes inefficient and cumbersome, so another method one could use is to provide some error detection circuitry for the fault to trigger. By using a parity error checker, for example, one could detect if a set of state variables were confined to an allowable subset. If we cannot detect the fault using these two methods, i.e., we do not have observability of the fault, then we assume that we cannot detect the defect.
Once we have discovered a malfunction, we would like to regenerate the transition into the error state many times, so we can isolate the fault by probing internal signals. To regenerate this error condition, we must have the system follow the same trajectory of state transitions as it did when the fault was first discovered. To do that, we need to have controllability in establishing the same initial states and the same inputs during every state change, so that the system will take the same steps as it did before.
If we can arrive at the error state every time, then we have determinism in repeating the fault condition. If, even though we have established the same initial states and inputs, the fault condition does not exist, then we have a transient defect which requires additional debugging effort. Once we have determinism in finding the fault, then various known debugging methods can be applied to this system to find the exact location of the fault. This process of setting the initial states and inputs, traversing through the same state trajectory to arrive at the error state every time, and analyzing the signals is known as deterministic debugging.