The present disclosure relates generally to facilitating input data fault recovery in a massively parallel computing applications, and more specifically, to a system for preventing failure between processes of a parallel computing application.
Massively parallel computing applications typically require complicated communication between nodes that is dependent on input data distributed across many nodes. In real time environments, such as in a multi-mode radar system, there might be N antenna beams each requiring computation at a single node, and M subswaths that generate M threads for computation at each node.