A massively parallel processor (MPP) is one type of supercomputer. An MPP consists of a large number of independent computing nodes (processors and memory) interconnected with a specialized high-speed network. The number of nodes in a supercomputer can be in the thousands. An application or task running on an MPP is divided into many subtasks, each of which executes on its own node. The subtasks execute in parallel, each subtask computing a portion of the final result. These individually computed results, in general, need to be combined multiple times during the execution of the overall application, with the combined intermediate result being sent back to each of the nodes running the subtasks of the application.
When the processes on the plurality of nodes in an MPP interact, the possibility of deadlock exists. Deadlock is a situation in which two or more processes are waiting for mutual messages or for related events to occur, but neither receives the notification, and just continues to wait. Deadlock can result from programming errors. Deadlock may also be a result of a hardware implementation, such that occasionally, due to hardware conditions, a possibility exists that the messages or notifications block each other, will never get sent to the waiting processes, and the processes end up deadlocked. In some computing environments, it may be acceptable to detect after-the-fact that deadlock has occurred and to correct the problem. This is not acceptable in a supercomputing MPP environment, where the number of interacting processes can be in the thousands. Even a very small possibility of deadlock can have large impacts on overall application performance.
There remains a need in the art for an improved engine and method for performing deadlock avoidance in an MPP.