Computer technology has continued to advance at a remarkable pace, with each subsequent generation of a computer system increasing in performance, functionality and storage capacity, and often at a reduced cost. A modern computer system typically comprises one or more central processing units (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communication buses and memory. A modern computer system also typically includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Sophisticated software at multiple levels directs a computer to perform massive numbers of these simple operations, enabling the computer to perform complex tasks. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster, and thereby enabling the use of software having enhanced function. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Enormous improvements in clock speed have been made possible by reduction in component size and integrated circuitry, to the point where an entire processor, and in some cases multiple processors along with auxiliary structures such as cache memories, can be implemented on a single integrated circuit chip. Despite these improvements in speed, the demand for ever faster computer systems has continued, a demand which can not be met solely by further reduction in component size and consequent increases in clock speed. Attention has therefore been directed to other approaches for further improvements in throughput of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using a parallel computer system incorporating multiple processors that operate in parallel with one another. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. Although the use of multiple processors creates additional complexity by introducing numerous architectural issues involving data coherency, conflicts for scarce resources, and so forth, it does provide the extra processing power needed to increase system throughput, given that individual processors can perform different tasks concurrently with one another.
Various types of multi-processor systems exist, but one such type of system is a massively parallel nodal system for computationally intensive applications. Such a system typically contains a large number of processing nodes, each node having its own processor or processors and local (nodal) memory, where the nodes are arranged in a regular matrix or lattice structure. The system contains a mechanism for communicating data among different nodes, a control mechanism for controlling the operation of the nodes, and an I/O mechanism for loading data into the nodes from one or more I/O devices and receiving output from the nodes to the I/O device(s). In general, each node acts as an independent computer system in that the addressable memory used by the processor is contained entirely within the processor's local node, and the processor has no capability to directly reference data addresses in other nodes. However, the control mechanism and I/O mechanism are shared by all the nodes.
A massively parallel nodal system such as described above is a general-purpose computer system in the sense that it is capable of executing general-purpose applications, but it is designed for optimum efficiency when executing computationally intensive applications, i.e., applications in which the proportion of computational processing relative to I/O processing is high. In such an application environment, each processing node can independently perform its own computationally intensive processing with minimal interference from the other nodes. In order to support computationally intensive processing applications which are processed by multiple nodes in cooperation, some form of inter-nodal data communication matrix is provided. This data communication matrix supports selective data communication paths in a manner likely to be useful for processing large processing applications in parallel, without providing a direct connection between any two arbitrary nodes. Optimally, I/O workload is relatively small, because the limited I/O resources would otherwise become a bottleneck to performance.
An exemplary massively parallel nodal system is the IBM Blue Gene®/L (BG/L) system. The BG/L system contains many (e.g., in the thousands) processing nodes, each having multiple processors and a common local (nodal) memory, and with five specialized networks interconnecting the nodes for different purposes. The processing nodes are arranged in a logical three-dimensional torus network having point-to-point data communication links between each node and its immediate neighbors in the network. Additionally, each node can be configured to operate either as a single node or multiple virtual nodes (one for each processor within the node), thus providing a fourth dimension of the logical network. A large processing application typically creates one ore more blocks of nodes, herein referred to as communicator sets, for performing specific sub-tasks during execution. The application may have an arbitrary number of such communicator sets, which may be created or dissolved at multiple points during application execution. The nodes of a communicator set typically comprise a rectangular parallelopiped of the three-dimensional torus network.
The hardware architecture supported by the BG/L system and other massively parallel computer systems provides a tremendous amount of potential computing power, e.g., petaflop or higher performance. Furthermore, the architectures of such systems are typically scalable for future increases in performance. However, unless the software applications running on the hardware architecture operate efficiently, the overall performance of such systems can suffer.
As an example, the BG/L system supports a message-passing programming library referred to as the Message Passing Interface (MPI). The selection of an appropriate network among the multiple networks supported in the BG/L system, as well as the appropriate communications algorithm, for each messaging primitive can have a significant effect on system and application performance. Consequently, in performance sensitive communications libraries like MPI, it is often the case that multiple internal implementations of the same external Application Programming Interface (API) function will exist, with each implementation differing from other implementations based upon the network and/or algorithm used. Typically, each implementation will have characteristics that make it them the fastest under some conditions, but not under other conditions, often based on the parameters passed to the function. For example, a simple point-to-point “send” may have two implementations: one for short messages designed to have low latency, and another for long messages designed to maximize bandwidth. This is frequently required because the positive aspects of low latency and high bandwidth tend to oppose each other.
In many instances, application developers will manually select specific implementations to call based upon the expected scenario under which those implementations will be used. In addition, in some instances a developer may hard code rules that select which implementation should be used in different circumstances. As a result, choices may be hard coded to some specific cut-off values, or the rules may be adjusted manually over time to attempt to find the fastest version. Neither of these solutions, however, will be able to reliably use the fastest implementation of a function on the first attempt to use the function, particularly when the precise environment within which a particular function may be called is not known beforehand.
It has been found, however, that the initial execution of an optimal implementation of an MPI function by an application can have a significant impact on the overall performance of a truly high performance parallel computing system. Furthermore, conventional solutions, which are essentially based upon hard coded rules established by a developer, have been found to lack the flexibility necessary to provide optimal selection of function implementations in different environments. In addition, it has been found that this same issue exists in connection with a wide variety of functions that may be implemented in a multitude of different ways. Therefore, a need exists for an improved manner of selecting from among multiple available implementations of a function to optimize the performance of applications running in a parallel computer system.