In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modem computer system typically comprises one or more central processing units (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communication buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU or CPUs are the heart of the system. They execute the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Sophisticated software at multiple levels directs a computer to perform massive numbers of these simple operations, enabling the computer to perform complex tasks. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster, and thereby enabling the use of software having enhanced function. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Enormous improvements in clock speed have been made possible by reduction in component size and integrated circuitry, to the point where an entire processor, and in some cases multiple processors along with auxiliary structures such as cache memories, can be implemented on a single integrated circuit chip. Despite these improvements in speed, the demand for ever faster computer systems has continued, a demand which can not be met solely by further reduction in component size and consequent increases in clock speed. Attention has therefore been directed to other approaches for further improvements in throughput of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made this approach practical. Although the use of multiple processors creates additional complexity by introducing numerous architectural issues involving data coherency, conflicts for scarce resources, and so forth, it does provide the extra processing power needed to increase system throughput.
Various types of multi-processor systems exist, but one such type of system is a massively parallel nodal system for computationally intensive applications. Such a system typically contains a large number of processing nodes, each node having its own processor or processors and local (nodal) memory, where the nodes are arranged in a regular matrix or lattice structure. The system contains a mechanism for communicating data among different nodes, a control mechanism for controlling the operation of the nodes, and an I/O mechanism for loading data into the nodes from one or more I/O devices and receiving output from the nodes to the I/O device(s). In general, each node acts as an independent computer system in that the addressable memory used by the processor is contained entirely within the processor's local node, and the processor has no capability to directly reference data addresses in other nodes. However, the control mechanism and I/O mechanism are shared by all the nodes.
A massively parallel nodal system such as described above is a general-purpose computer system in the sense that it is capable of executing general-purpose applications, but it is designed for optimum efficiency when executing computationally intensive applications, i.e., applications in which the proportion of computational processing relative to I/O processing is high. In such an application environment, each processing node can independently perform its own computationally intensive processing with minimal interference from the other nodes. An inter-nodal data communication matrix supports cooperation among nodes in processing large applications in parallel. Optimally, I/O workload is relatively small in comparison to the collective processing capabilities of the nodes' processors, because the limited I/O resources would otherwise become a bottleneck to performance.
In a massively parallel nodal system, a single node may contain a single processor (sometimes called a processor core), or may contain multiple processors. In some massively parallel systems, multiple processors within a node can act as independent processing entities, each executing a respective user application process and maintaining process state independently.
An exemplary massively parallel nodal system is the IBM Blue Gene™ system. The IBM Blue Gene system contains many processing nodes, each having multiple processors and a common local (nodal) memory. The processing node are arranged in a logical three-dimensional torus network having point-to-point data communication links between each node and its immediate neighbors in the network. Additionally, each node can be configured to operate either as a single node (coprocessor mode) or as multiple virtual nodes (virtual node mode), thus providing a fourth dimension of the logical network.
In coprocessor mode, one of the processors acts as a primary processor directing the execution of a user application process, while the other processor or processors act as co-processors for performing tasks assigned by the primary processor, such as I/O operations. In coprocessor mode, the entire nodal memory is dedicated to the threads being executed by the primary processor and is directly addressable by the primary processor. In multi-processor or “virtual node” mode, each processor acts independently of the other, executing a respective user application process and maintaining a process state independently. The processes executing in the different processors in virtual node mode may be, and usually are, parts of a common user application, although they need not be.
The architecture of the certain massively parallel nodal systems such as IBM Blue Gene systems is designed around the idea that each node has its own independent state and independent memory. When a node is configured to run in multiprocessor mode, each processor portion of the node should act, for most purposes, as if it were an independent node. In particular, each processor portion of the node should have its own independent memory, directly addressable by it alone and not by other processors, including the other processor or processors in the same node. Since the node contains a single common physical memory, it is desirable that this memory be subdivided among the processors on a fixed basis, so that each processor has its own portion.
Subdividing of the local nodal memory is a relatively static operation. A process executing in a local memory portion generally needs to be guaranteed that memory once there will stay there for the duration of the process, or data may be lost. Existing Blue Gene systems partition the local memory in fixed, discrete, equal partitions for each processor when configured to run in multiprocessor mode. Unfortunately, some processes require or optimally execute using more memory than the fixed portion, while others require less. It is generally difficult or impossible to predict the memory requirements of processes in advance.
It would be desirable to provide some form of dynamic or variable subdividing of the nodal memory in a massively parallel nodal system having multiple processors in each node, while at the same time preventing memory starvation of processes and maintaining architectural constraints of isolating the processes of different processors. It would further be desirable to provide a software-based mechanism for subdividing nodal memory, which does not require special hardware support.