1. Field of the Invention
The present invention generally relates to parallel computing. More specifically, the present invention relates to an interactive tool for visualizing performance data in real-time to enable adaptive performance optimization and feedback.
2. Description of the Related Art
Powerful computers may be designed as highly parallel systems where the processing activity of hundreds, if not thousands, of processors (CPUs) are coordinated to perform computing tasks. These systems are highly useful for a broad variety of applications including, financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, prime number factoring, image processing (e.g., CGI animations and rendering), to name but a few examples.
For example, one family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene®. The Blue Gene/L architecture provides a scalable, parallel computer that may be configured with a maximum of 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene/L architecture has been successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among five out of the ten top most powerful computers in the world.
In addition to the Blue Gene architecture developed by IBM, other highly parallel computer systems have been (and are being) developed. For example, a Beowulf cluster may be built from a collection of commodity off-the-shelf personal computers. In a Beowulf cluster, individual systems are connected using local area network technology (e.g., Gigabit Ethernet) and system software is used to execute programs written for parallel processing on the cluster of individual systems.
Compute nodes in a parallel system communicate with one another over one or more communication networks. For example, the compute nodes of a Blue Gene/L system are interconnected using five specialized networks, and the primary communication strategy for the Blue Gene/L system is message passing over a torus network (i.e., a set of point-to-point links between pairs of nodes). This message passing allows programs written for parallel processing to use high level interfaces such as Message Passing Interface (MPI) and Aggregate Remote Memory Copy Interface (ARMCI) to perform computing tasks and to distribute data among a set of compute nodes. Other parallel architectures (e.g., a Beowulf cluster) also use MPI and ARMCI for data communication between compute nodes. Low level network interfaces communicate higher level messages using small messages known as packets. Typically, MPI messages are encapsulated in a set of packets which are transmitted from a source node to a destination node over a communications network (e.g., the torus network of a Blue Gene system).
Frequently, network contention is a major problem for the scalability of an application on a large parallel system. That is, compute nodes may compete with one another for access to the communication networks interconnecting the nodes on which the application is executing and as more compute nodes are dedicated to a given application, the more inter-node communication is typically required. Thus, it is desirable to optimize the configuration a given software application, including optimizing network communication patterns of the application. Further, communication patterns tend to be different at computational phases of program execution and are often quite complex.
Furthermore, supercomputing resources are a scarce commodity, and access to a parallel computing system is usually rented and/or allocated in small discrete blocks of time. When optimizing such an application, therefore, it is important to gather as much information on as many configurations of a parallel system and/or an application as is possible within an allotted time window.
Accordingly, there remains a need for an interactive tool for visualizing performance data in real-time to enable adaptive performance optimization and feedback on a large parallel computing system.