1. Technical Field
This invention relates generally to multi-processor computer systems, and more particularly to such systems in which there are a number of building blocks divided into a number of partitions.
2. Description of the Prior Art
There are many different types of multi-processor computer systems. A symmetric multi-processor (SMP) system includes a number of processors that share a common memory. SMP systems provide scalability. As needs dictate, additional processors can be added. SMP systems usually range from two to 32 or more processors. One processor generally boots the system and loads the SMP operating system, which brings the other processors online. Without partitioning, there is only one instance of the operating system and one instance of the application in memory. The operating system uses the processors as a pool of processing resources, all executing simultaneously, where each processor either processes data or is in an idle loop waiting to perform a task. SMP systems increase in speed whenever processes can be overlapped.
A massively parallel processor (MPP) system can use thousands or more processors. MPP systems use a different programming paradigm than the more common SMP systems. In an MPP system, each processor contains its own memory and copy of the operating system and application. Each subsystem communicates with the others through a high-speed interconnect. To use an MPP system effectively, an information-processing problem should be breakable into pieces that can be solved simultaneously. For example, in scientific environments, certain simulations and mathematical problems can be split apart and each part processed at the same time.
A non-uniform memory access (NUMA) system is a multi-processing system in which memory is separated into distinct groups. NUMA systems are similar to SMP systems. In SMP systems, however, all processors access a common memory at the same speed. By comparison, in a NUMA system, memory on the same processor board, or in the same building block, as the processor is accessed faster than memory on other processor boards, or in other building blocks. That is, local memory is accessed faster than distant shared memory. NUMA systems generally scale better to higher numbers of processors than SMP systems.
A particular type of NUMA system is the cache coherent NUMA (CC-NUMA) system. In a CC-NUMA system, the system hardware handles cache coherency between the system building blocks, as well as within them. That is, hardware cache coherency means that there is no software requirement for keeping multiple copies of data up to date, or for transferring data between multiple instances of the operating system or an application. Thus, distributed memory is tied together to form a single memory, and there is no copying of pages or data between memory locations. There is also no software message passing, but rather a single memory map having pieces physically tied together with sophisticated hardware.
The term building block is used herein in a general manner, and encompasses a separable grouping of processor(s), other hardware, such as memory, and software that can communicate with other building blocks. Building blocks can themselves be grouped together into partitions. A single partition runs a single instance of an operating system. A partition can include one or more building blocks. A system, or a platform, is the whole of all the partitions of all the building blocks. Thus, the building blocks of a platform may be partitioned into a number of partitions of the platform, and so on. Furthermore, two or more partitions can be grouped together as a cluster, where each partition runs its own operating system instance, but has access to shared storage with the other partitions. A cluster is therefore different than a partition, and a partition is different than a building block. The term node is not used herein, as it can sometimes refer to a partition, and other times refer to a building block.
Another particular type of NUMA system is the NUMA-quad (NUMA-Q) system. A NUMA-Q system is a NUMA system in which the fundamental building block is the quad, or the quad building block (QBB). Each quad can contain up to four processors, a set of memory arrays, and an input/output (I/O) processor (IOP) that, through two host bus adapters (HBAs), accommodates two to eight I/O buses. An internal switch in each QBB allows all processors equal access to both local memory and the I/O buses connected to the local I/O processor. An application running on a processor in one QBB can thus access the local memory of its own QBB, as well as the shared memory of the other QBBs. More generally, a quad refers to a building block having at least a collection of up to four processors and an amount of memory.
A difficulty with nearly any type of multi-processor computer system is the manner by which building blocks are bound together into partitions at startup. One approach involves selecting a master building block, which oversees the booting up of the other building blocks, as well as the partitioning of the building blocks into the desired partitions. However, this approach is not particularly fault-tolerant, in that should the master building block fail, the entire platform can potentially also fail, since there is no master overseeing the partitioning process. Redundant master building blocks and other ways to add fault tolerance to the system have been suggested, but can be overly complex and difficult to implement.
Another approach to binding building blocks into desired partitions at startup can be referred to as the masterless approach, in that no single building block is a priori designated as the master to oversee the binding process. Traditionally, however, the masterless approach has been plagued by race conditions and other difficulties. For example, two building blocks may decide to become the temporary master at the same time. However, having a preordained ordering of which building blocks are to temporarily retain master status is also problematic, because two otherwise identical building blocks may complete their startup processes in different lengths of time and/or at different times. Ensuring the orderly binding of building blocks into partitions is thus difficult to guarantee.
For these described reasons, as well as other reasons, therefore, there is a need for the present invention.
The invention relates to a masterless approach for binding building blocks into partitions. The adjectives first and second are used herein for distinguishing among different instances of the noun to which they relate. For example, the terms first physical port identifier and second physical port identifier use the adjectives first and second to distinguish between the former physical port identifier and the latter port identifier. The adjectives first and second have no other inherent or implied meaning other than their use for distinguishing purposes.
A method of the invention for binding a building block of a platform to a partition in a masterless manner first sends to other building blocks of the platform a first physical port identifier indicating the physical location of the building block in the platform. A first partition identifier indicating the partition of the building block is also sent to the other building blocks. Second physical port identifiers and second partition identifiers are received from the other building blocks. The first physical port identifier and the second physical port identifiers of a subset of the other building blocks are then sent to the subset, where the second partition identifiers of the subset are equal to the first partition identifier. The first physical port identifier and the second physical port identifiers of the subset are also received from every other building block of the subset. A first logical port identifier indicating the logical location of the building block in the partition identified by the first partition identifier is sent to the subset of the other building blocks, and second logical port identifiers are received from the subset. The partition indicated by the first partition identifier is then joined by the building block.
A system of the invention includes a platform, a number of building blocks of the platform, and a number of partitions of the platform. Each building block has a physical port identifier that indicates its physical location in the platform, a partition identifier, and a logical port identifier indicating its logical location in the partition identified by the partition identifier. The partition identifier of each building block indicates one of the number of partitions to which the building block is bound in a masterless manner. The masterless manner uses the physical port identifiers, the logical port identifiers, and the partition identifiers of the number of building blocks to bind the blocks to partitions.
An article of manufacture of the invention includes a computer-readable medium and means in the medium. The means in the medium is for joining a partition indicated by a first partition identifier of a building block of a platform in a masterless manner. The masterless manner uses the first partition identifier, a first physical port identifier, and a first logical port identifier, as well as second physical port identifiers, and second logical port identifiers of other building blocks of the platform to join the partition. The first physical port identifier indicates the physical location of the building block in the platform, and the first logical port identifier indicates the logical location of the building block in the partition identified by the first partition identifier. Other features and advantages of the invention will become apparent from the following detailed description of the presently preferred embodiment of the invention, taken in conjunction with the accompanying drawings.