1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatuses, and computer program products for managing workload distribution among a plurality of compute nodes.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Modern computing systems can be massively parallel and include many compute nodes within a computing system. To distribute workload assignments among the compute nodes, a system may utilize a distribution controller. The ability of the distribution controller to properly distribute workload assignments may be hampered by errors within the compute node during execution or consumption of one or more workload assignments. For example, a compute node that is operating in error may provide information to the distribution controller indicating workload assignments have been completed. However, because of the error within the compute node, the workload assignments are ‘completed’ quickly and thus the distribution controller distributes more workload assignments to the error-generating compute node. This problem, known as the “Storm Drain Problem,” can be especially hard to correct when the feedback from the compute node to the distribution controller does not explicitly indicative that the compute node is operating in error. The method in which the distribution controller chooses to distribute the workload assignments in this situation can make a big difference on the overall effect seen by users of the parallel computer and the efficiency of the compute nodes within the parallel computer.