Field of the Invention
Embodiments of the invention relate to multi-node computing systems.
Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of thousands of processors (CPUs) is coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications, including financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, and image processing (e.g., CGI animations and rendering), to name but a few examples.
For example, one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene/L architecture has been successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among five out of the ten top most powerful computers in the world.
One requirement for a cluster computer system, such as Blue Gene/L system, is to provide for monitoring consistency and operational capability of the system and its applications. Thus, it is useful to log functioning of compute nodes of the cluster computer system and trace executing applications running on the compute nodes and report any detected errors and/or failures. For example, developers may use tracing information, such as tracing messages, for debugging applications running on the compute nodes, while administrators may use logging information, such as logging events, for diagnostic and auditing of the system. However, for such information to be useful, numerous messages/events related to functioning of the compute nodes and executing of the applications have to be generated, transmitted, and stored. As number of compute nodes in a cluster computer system grows, generating, transmitting, collecting, and/or storing tracing messages/logging events requires more and more resources, and thus, becomes expensive to support. Moreover, as amount of tracing/logging performed by the computer system increases, the probability of such tracing/logging adversely impacting performance and stability of the computer system increases as well. Additionally, as more and more data is gathered, analyzing such data becomes problematic simply because of the mere volume of data.