Massively parallel processing (MPP) systems have a large number of independent processors or nodes that are connected together and execute in parallel to form a very large computer. MPP systems are designed to solve complex mathematical problems that are highly computationally intensive. These problems are capable of exploiting the entire processing power of the system and typically take many days to complete.
Even though the individual nodes in an MPP system are designed to have a high Mean Time to Failure (MTTF) value, the reliability of the total system goes down significantly as the number of nodes goes up. For example, if the MTTF of a processor is 1000 days, a system with thousand such processors will have a net MTTF value of one day.
As a result, modern supercomputers with thousands of compute nodes encounter frequent crashes that severely impact the workload completion time.