1. Field of the Invention
The present invention relates to multi-processor systems. More specifically, the present invention relates to a method and an apparatus for minimizing perturbations while monitoring application software or programs running on a multi-processor system.
2. Art Background
An application, particularly a complex application, can be optimized by monitoring its performance and improving the parts of the application that display an unsatisfactory performance. One of the categories of monitoring that can be performed on an application is event tracing. In event tracing, data is captured in time stamped records when events occur. It is not uncommon for a running program to generate hundreds or thousands of such events in a very short period of time. For large scale parallel applications, these event traces must be captured for each processor utilized by the application. The high frequency of event trace generation per processor, coupled with the potential of using a very large number of processors, results in extremely large quantities of event trace data. These event traces can quickly grow too large to keep in the processor memory and must be saved on some secondary storage medium ("secondary memory"), e.g. a hard disk.
Transmission of event trace data to the secondary memory creates a problem. If the event trace data is transmitted on the system interconnect facilities used for program-related communication, it is not uncommon for the interconnect utilization of event tracing to exceed the interconnect utilization of the program being studied. As a result, event tracing can significantly perturb the very application behavior being monitored. This outcome is unacceptable in that the process of monitoring is self-defeating, resulting in a Heisenberg-like uncertainty in the experiment.
The prior art method of transferring performance monitoring data to a secondary memory failed to effectively minimize the perturbation, causing the application behavior to be significantly different from what it would have been without the monitoring.
The prior art method will now be described with reference to FIGS. 1 and 2. FIG. 1 shows a simplified schematic of a multiprocessor system with a plurality of nodes. The multiprocessor system 100 of FIG. 1, includes a plurality of nodes such as node 105, 110, 115, 120. Each node has one or more processors and is connected to one or more other nodes via high speed interconnects, such as interconnects 125, 130, 135. FIG. 2 illustrates a node of the multiprocessor system of FIG. 1. Each node contains one processor such as CPU 200 (or a plurality of processors) connected via a processor memory bus 210 to a corresponding memory 205.
According to the prior art method, when the performance monitoring data buffer on a node, e.g., some part of memory 205 on node 105, is filled, the running of the application is stopped on node 105 and the performance monitoring data is transferred to the secondary memory (not shown). Meanwhile the other nodes, nodes 110, 115, 120, etc. in the system continue to run the application. Once the data has been transferred to the secondary memory, node 105 resumes running the application. This prior art method causes two significant problems.
First, it results in the Heisenberg-like uncertainty referred to earlier. For example, with reference to FIG. 1, node 105 transfers performance monitoring data on link 125 while the other nodes send application related material on link 125. This creates a problem in that because node 105 transfers data on link 125, other nodes in the system cannot use link 125 during the period of time of the transfer of monitoring data to secondary memory by node 105. For example, assume the period of time it takes node 105 to make this transfer is tp. If the application were not being monitored, time tp could be used by the application to transfer data on link 125. When the application is monitored, the use of link 125 for time tp to transfer data causes some message passing events on any node attempting to use link 125 to be delayed by time tp. The Heisenberg-like uncertainty is caused because there is no way to factor tp out of the resulting event trace. A programmer looking at the event trace will incorrectly conclude that this application kept link 125 busy 100% of the time, when in fact tp additional time is really available to the application.
The second problem caused by the prior art method is the skewing in event tracing. Because node 105 stops running the application while other nodes continue to run the application, the operations on node 105 are delayed by a period of time equal to the time period during which the running of the application is halted on node 105. Events on node 105 occur later than they would have without monitoring. Thus, the event tracing gives a chronology of events that is different from what it would have been without the monitoring. According to the event tracing data, events on node 105 occur later than some other events on other nodes. However, were it not for the performance monitoring, those same events on node 105 would have occurred before the aforementioned other events.
It is important to note that a simple shift in time for all events on all nodes would not be of concern so long as the relative time between events is the same as it would have been without tracing (i.e., events x and y occur at 10 and 12 seconds respectively rather than at 5 and 7 seconds, as would have been the case without monitoring). However, the problem caused by the prior art is far more complicated. Specifically, prior art techniques change the relative timing of events (i.e., in the above example, event y would occur before event x). The problem is further aggravated as the initial delay is propagated throughout the system as other operations, which depend on the initially delayed operation, are delayed as well by varying amounts of time. The delay is further compounded with delays caused by the unloading of performance monitoring data from other nodes to the secondary memory. Eventually, as more nodes transfer event trace data to the secondary memory, the event ordering can be significantly skewed. Thus, the application behavior depicted by the resulting event trace bears very little, if any, relation to the chronology of events without monitoring. Again, skewing is not merely a time shifting of all the time stamps by the same amount of time, but a nearly intractable juxtapositioning of the events. Therefore, event tracing that accurately corresponds to the chronology of events without monitoring becomes practically impossible.
The prior art avoided substantially skewing event tracing by only monitoring the application for very short periods of time or by coarse grain sampling (every 1,000 events instead of every event). This minimized the amount of performance monitoring data and consequently the time needed to transfer the performance monitoring data to the secondary memory. However, it failed to allow fine grain monitoring of applications for a relatively long time while minimizing perturbations, and coarse grain sampling cannot always be used to determine application behavior.
The prior art provides two other common approaches to solving the problem of perturbing the application. The first approach involves the use of two inter-node communication facilities (ICFs), one for the application communications and a second for event trace data. The primary disadvantage of this solution is high cost. The hardware necessary for an ICF must be duplicated, doubling the cost of this part of the system. Second, there is no known secondary memory besides the main memory that can operate at high enough bandwidths to capture event trace data for a large parallel computer. Therefore, the application can only be monitored for relatively small periods of time without altering application behavior.
The second approach is to use a cross processor interrupt facility to start and stop the application in lock step. Once a node notices that it needs to move performance monitoring data to a secondary memory, it interrupts all other nodes on the system. Each node then moves its performance monitoring data to the secondary memory while the application is stopped. The same mechanism is used to restart the application. The primary advantage of this approach over the dedicated ICF is that relatively long periods of time can be monitored. This approach is also expensive to build for a large parallel system. Additionally, this type of facility is difficult to partition. If two separate applications are running, each on a portion of the processors, an ideal solution would only stop the processors running the application being monitored. It is not practical to build the hardware support needed to implement the ideal solution. Instead, with the practical hardware support, the cross processor interrupt facility would interrupt all nodes on the system, including those that are not running the application being monitored. This method, therefore, increases interference with other applications running on the system that are not being monitored. For the cross processor interrupt facility, if it is desired to partition the interrupt, it is difficult to know how many times to partition the system. Since a separate wire for each partition of the cross processor facility must be run, a guess that is too high would be expensive since it would cover situations that would never arise. Therefore, it is preferable to guess at a level that covers most situations. Of course, in this case, the system cannot monitor all applications.
A third prior art approach will now be described with reference to the flow chart of FIG. 3.
The mechanism for implementation is either an interrupting signal (for example, a UNIX signal) or an interrupting message (for example, an Intel Paragon hsend/hrecv). This mechanism allows one node to perform an action that causes the receiving node to stop execution of the code it is running, and branch off to a special handler. This handler can perform actions, and when complete can return to the code the node was running before the interrupt occurred.
As in the method described with reference to FIG. 9, the application is allowed to execute until some node's performance monitoring buffer reaches a threshold. When that occurs, an interrupting message or signal is sent to all other nodes.
To minimize perturbation, the nodes wait at a global synchronization until all nodes have stopped. This is necessary because there is a large amount of skew in when the nodes will receive their stop interrupt.
Once all nodes have stopped, buffers are flushed to secondary storage.
The nodes again wait until all buffers have been written to disk at a global synchronization.
When all nodes reach the synchronization point, the flush is complete and the application is restarted.
However, this approach has several flaws.
First, since a software mechanism is used to stop the nodes, they do not stop at the same time. Since nodes stop out of sync, the chronological order of events is not preserved as some nodes stop immediately, while others continue to execute. This is a result of having no hardware support to interrupt multiple nodes as in both our solution and the crossbar switch solution.
Second, unless the system provides partitionable support for global synchronization, there is also skew as software messages propagate the signal the it is time to restart the application. Because some nodes start sooner than others, the chronological order of events is not preserved.