1. Field of the Invention
Embodiments of the invention generally relate to improving system utilization on a massively parallel computer system. More specifically, embodiments of the invention are related to recovering from a resource leak on a compute node (or nodes) of a multi-node computer system.
2. Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of hundreds, if not thousands, of processors (CPUs) are coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications, including financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, and image processing (e.g., CGI animations and rendering), to name but a few examples.
For example, one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene/L architecture has been successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among five out of the ten top most powerful computers in the world.
Each compute node in a massively parallel computing system may be configured to run multiple computing jobs. The jobs can be part of a single computing task or independent from one another. In some cases, a job may leave behind unwanted remnants, for example, a job may leave behind orphaned processes or temporary files stored in memory. The presence of such artifacts on a given node reduces the resources available to future computing jobs scheduled to execute on that node. Although the impact on a single node may be small, when a computing job executed on thousands of nodes creates a resource leak, the performance of the entire computing system may be substantially reduced.