Message passing serves as an effective programming technique for exploiting coarse-grained concurrency on distributed computers, as seen in the popularity of the Message Passing Interface (MPI). Unfortunately, debugging message-passing applications can be difficult. Analysis tools for MPI applications produce tracefiles that can be analyzed with a trace analyzer performance analysis tool. In MPI processes, such tools record calls to the MPI library and transmitted messages, and allow arbitrary user-defined events to be recorded. Instrumentation can be switched on or off at runtime. While such tools can aid in detecting errors, current correction checking tools cannot adequately detect transmission and implementation problems for various operations, such as reduce operations.
Hardware, driver and system software problems can introduce bit errors into data transmitted between processes in a parallel application or lead to truncated transmissions. Traditionally, checksums are used to detect errors. Error correction codes help to reconstruct the original data. This can be done at all levels in a communication stack as well as added to it at the application level. Parallel reduce operations differ from verbatim transmission of data in that they modify the data in some configurable and perhaps programmable way while the data is in transmission.
In addition, deadlocks caused by communication between processes in parallel applications can occur. Such deadlocks may include actual or real deadlocks, as well as potential deadlocks, which are deadlocks that only occur on specific platforms or configurations and thus cannot be detected using traditional monitoring of application progress and/or timeouts, as with actual deadlocks. Accordingly, current correction checking tools cannot adequately detect potential conflicts.