There are a number of applications in which it is required to roll back the state of a computer system. For example, fault-tolerant applications require a fast recovery from failure. Debugging software requires such a roll back but does not require the roll back to be on-line. However, there is a need to keep the debugging non-intrusive.
Other applications requiring the state of a computer system to be rolled back include (1) implementing atomic transactions in databases and (2) implementing an "UNDO" facility at the programming language level.
Debugging parallel computer programs is difficult for a number of reasons. First, it is difficult to understand program behavior since there are multiple threads of control active at any given instant of time. Second, because the scheduling of control between these multiple threads is non-deterministic the behavior of a parallel program A cross successive executions may be different. Third, simple methods of monitoring programs by introducing monitoring statements are inadequate because they may perturb the behavior of the program they intend to monitor and thereby mask or unmask timing errors inherent in the program. This third point is generally also referred to as the Heisenbug Uncertainty Principle and is often stated as follows:
Measuring the program state, perturbs the state; and the degree of perturbation is proportional to the fraction of the state that is captured i.e., the greater the volume of information collected, the lower the accuracy. PA1 i is a temporary instruction (introduced solely for debugging or testing purposes) within the target program and will be removed before the final version of the program is released. PA1 i is a part of a debugger and is executed on the same processor as the genuine instructions (those contributing to the computation) of the target program. PA1 i is a part of the debugger and is not executed on the same processor as the genuine instructions of the target program, but i consumes a non-zero number of bus cycles from the target processor.
The problems with debugging parallel programs have been known for sometime. However, little has been done to solve this problem, largely because a complete solution requires a coordinated and concerted effort that involves both hardware and software. Most, or nearly all, of the approaches developed to date are based principally on software and suffer from the Heisenbug Uncertainty Principle. Therefore, all such approaches are fundamentally weak in that they do not achieve zero perturbation and therefore may not detect certain severe time critical errors.
The Bell System 1-A was one of the first full scale non-interfering debuggers. The System 1-A was developed by Bell Laboratories for their electronic switching system application. Non-interference was achieved through circuit duplication. The Bell System 1-A provides debugging capability at varying degrees of interference (interfering, moderately interfering and non-interfering). The hardware monitoring was achieved by a programmable device called UTC (Utility Test Console). The UTC consisted of circuitry to detect the occurrence of an address or data or any arbitrary bit pattern within the data. Each of these low level events triggered an action which could be to start or stop a trace or interrupt the computer.
The System 1-A was a pioneering effort in non-intrusive debugging. However, its main drawback was that it was a single phase debugger and did not allow the programmer to replay the program. To identify an error, the programmer has to peruse through voluminous monitoring data. Since the programmer did not get an opportunity to replay the program, he/she had to carefully plan what to monitor the first time around. In a sense, therefore, it required the programmer to know the nature of the error before the start of the debugging session. Other approaches also suffered from similar problems.
Significant among the other approaches are the replay based techniques disclosed in T. J. LeBlac and John M. Mellor-Crummey, "Debugging Parallel Programs with Instant Replay," IEEE Transactions on Computers, pages 471-482, Apr. 1987 and Chiu's debugger for atomic transaction system as described at Y. S. Chiu, "Debugging Distributed Computations in a Nested Atomic Action System", PhD thesis, laboratory for Computer Science, MIT, 1984. All these approaches however interfere with the target program. Other approaches also involve reversible execution systems where the programmer is allowed to roll back the program to a previous state. Again, these approaches interfere with the target program.
In summary, the literature is scarce when it comes to debugging real-time programs (where it is not only important to compute the right value but to do so within strict time constraints). There are certain errors in real-time programs that don't occur in non-real-time programs.
A real-time application is one in which the correctness of a program is not solely defined by the value of the output that is computed, but also by the time at which such a value is computed. A correctly computed value that is not done so within a certain time limit is considered as erroneous. In certain applications, meeting time deadlines is so important that an incorrect value computed within a certain deadline is acceptable while a correct computation that misses the deadline is not.
Since meeting deadlines is so crucial in real-time systems, it is necessary to schedule the processes appropriately to meet the timing constraints. Scheduling, therefore, is an integral part of a real-time program. An incorrect schedule could lead to a missed deadline.
Scheduling errors are typical of real-time programs, where, missing a dealing is perceived as an error. These errors are also not reproducible but the reasons for their irreproducibility is not just limited to relative progress of the constituent processes. The following are some of the situations during which a scheduling error may occur:
(1) Incorrect Scheduling: Usually in time-critical programs where meeting the deadlines is of utmost importance, the program manages the scheduling of control between its processes. An incorrect task schedule, for example, could result in a deadline being missed. Even though the program explicitly manages the scheduling, the exact schedule need not be the same during each re-execution because of asynchronous events like interrupts. As a result, the deadline could be missed sometimes and not at others.
(2) Unpredictable Timing: Sometimes, however, a deadline may be missed even though the schedule was correct. For example, a deadline may be missed in spite of a correct task and instruction schedule, because an instruction or a set of instructions took more time than anticipated.
A machine instruction i, executed during a particular run of the target program P, is said to be interfering or intrusive with the execution of P, if any of the following are satisfied:
The bus-cycle condition emphasizes the fact that having an auxiliary processor on which the debugger runs, does not in itself, guarantee non-intrusiveness. For example, if the auxiliary processor shares the same memory as the target processor, then instructions executed on it will steal some cycles from the target processor to access the memory and will therefore contribute to the interference.
According to the above definition, an instruction i left permanently within the target program is not considered as interfering.