Parallel computer systems have proven to be an expedient solution for achieving greatly increased processing speeds heretofore beyond the capabilities of conventional computational architectures. With the advent of massively parallel processing machines such as the IBM RS/6000 SP1 and the IBM RS/6000 SP2, volumes of data may be efficiently managed and complex computations may be rapidly performed. (IBM and RS/6000 are registered trademarks of International Business Machines Corporation, Old Orchard Road, Armonk, N.Y., the assignee of the present application).
A typical massively parallel processing system may include a relatively large number, often in the hundreds or even thousands of separate, though relatively simple, microprocessor-based nodes which are interconnected via a communications fabric comprising a high speed packet switch network. Messages, in the form of packets are routed over the network between the nodes enabling communication therebetween. The nodes typically comprise a microprocessor and associated support circuitry such as random access memory (RAM), read only memory (ROM), and input/output I/O circuitry which may further include a communications subsystem having an interface for enabling the node to communicate through the network.
Among the wide variety of available forms of packet networks currently available, perhaps the most traditional architecture implements a multi-stage interconnected arrangement of relatively small cross point switches, with each switch typically being an N-port bi-directional router where N is usually either 4 or 8 and with each of the N ports internally interconnected via a cross point matrix. For our purposes herein, we will consider the switch to be an 8 port router switch. In such a network, each switch in one stage, beginning at one side (so-called input side) of the network is interconnected through a unique path (typically a byte-wide physical connection) to a switch in the next succeeding stage, and so forth until the last stage is reached at an opposite side (so called output side) of the network. The bi-directional router switch included in this network is generally available as a single integrated circuit (i.e. a "switch chip") which is operationally non-blocking, and accordingly a popular design choice. Such a switch chip is described in the U.S. patent application having Ser. No. 08/424,824 entitled "A Central Shared Queue Based Time Multiplexed Packet Switch With Deadlock Avoidance" by P. Hochschild et al. filed Mar. 4, 1996, now U.S. Pat. No. 5,546,391.
A switching network typically comprises a number of these switch chips organized into two interconnected stages, for example: a four switch chip input stage followed by a four switch chip output stage, all of the eight switch chips being included on a single switch board. With such an arrangement, messages passing between any two ports on different switch chips in the input stage would first be routed through the switch chip in the input stage that contains the source or input port, to any of the four switches comprising the output stage and subsequently, through the switch chip in the output stage the message would be routed back (i.e. the message packet would reverse its direction) to the switch chip in the input stage including the destination (output) port for the message. Alternatively, in larger systems comprising a plurality of such switch boards, messages may be routed from a processing node, through a switch chip in the input stage of the switch board to a switch chip in the output stage of the switch board and from the output stage switch chip to another interconnected switch board (and thereon to a switch chip in the input stage). Within an exemplary switch board, switch chips that are directly linked to nodes are termed node switch chips (NSCs) and those which are connected directly to other switch boards are termed link switch chips (LSCs). Inter-switch chip routing is typically pre-defined during system initialization and rarely ever altered thereafter.
Switch boards of the type described above may simply interconnect a plurality of nodes, or alternatively, in larger systems, a plurality of interconnected switch boards may have their input stages connected to nodes and their output stages connected to other switch boards, these are termed node switch boards (NSBs). Even more complex switching networks may comprise intermediate stage switch boards which are interposed between and interconnect a plurality of NSBs. These intermediate switch boards (ISBs) serve as a conduit for routing message packets between nodes coupled to switches in a first and a second NSB. For purposes of the ensuing discussion, the switch chips located on these ISBs will be termed intermediate switch chips (ISCs).
In massively parallel processing systems, it is a popular implementation choice to partition the processing nodes of the system so as to establish multiple smaller parallel processing systems within the massively parallel processing system. Disjoint sets of the processing nodes of the massively parallel system are located exclusively within one of the plurality of smaller system partitions and cannot share communication paths with the sets of nodes residing in other system partitions.
The U.S. patent application having Ser. No. 08/664,900 entitled "System Partitioning for Massively Parallel Processors" filed Jun. 17, 1996, by Brenner et al. now U.S. Pat. No. 5,799,149, as well as the pending, cross-referenced U.S. patent applications having Ser. Nos. 08/664,577 entitled "System for Preserving Logical Partitions of Distributed Parallel Processing System After Re-Booting by Mapping Nodes to their Sub-Enviroments" filed Jun. 17, 1996, 08/664,580 entitled "An Apparatus and Method for Creating Isolated Sub-Environments Using Host Names and Aliases" filed Jun. 17, 1996, and 08/664,689 entitled "Use Of Daemons in a Partitioned Massively Parallel Processing System Environment" filed Jun. 17, 1996, all by Brenner et al. and all commonly assigned to the present assignee, are directed toward creating node-based system partitions in a massively parallel processing system, and while they are not directed toward providing a method and apparatus for efficiently allocating the switching fabric of the massively parallel processing system among the system partitions, they do provide an excellent background for the present invention, and as such, are incorporated herein by reference.
Partitioning of multinode systems provides the user with the ability to completely isolate computing environments within the parallel processing system from one another. This ability to carve out isolated smaller partitions of processors from a larger processing system has proven advantageous for a variety of system implementations. For example, a test environment for a new beta-level version of an operating system may be run on the same system, but in a system partition which is completely isolated from a production environment operating system operating on a different system partition. Moreover, in designing optimized computing environments within a single partitioned parallel processing system, the cross-over of packet traffic from a first partition to the switches of a second partition may degrade the performance of the computing environment associated with the shared switches. For example, a plurality of processing nodes in the massively parallel processing system may be used for processing a parallel data base system, while the remaining nodes are used to process another, time critical, parallel processing application. While the massively parallel processing system can accommodate the concurrent execution of both of these jobs, each job that is executed competes for a limited set of node and switch resource. In a switching fabric of a massively parallel processing system utilizing a high performance switch it is possible for one job to monopolize the switch resource and thereby degrade the performance of the other job. Accordingly, to ensure optimal performance for concurrently operating computing environments within a single parallel processing system, disjoint partitioning of the switching resource among the disjoint system partitions must be implemented in a manner which ensures that each system partition makes the most efficient use of its allocated switching resource.
Massively parallel processing machines have previously been implemented so as to provide the user with a pre-defined static set of partition configurations incorporating many constraints. For example, in the case of the RS/6000 SP2, prior to the present invention, a maximum of only three partitions were permitted and the smallest partition would typically be set at all processing nodes connecting to a single NSB. The switch resource partitioning and allocation techniques presented herein advantageously free a system administrator to implement customized partitions within a parallel processing system which may not be included within the previously provided static configuration set, as well as providing system optimization capabilities.
From the foregoing it is clear that in order to accommodate a flexible partitioning of parallel processing systems, the switch network must likewise be capable of being flexibly partitioned among the system partitions to provide communication links between nodes within the same partition while ensuring that communication paths between nodes in different partitions do not intersect. Since a number of physical constraints exist for allocating resources on the switch network to system partitions, implementation of this partitioned switching network creates resource allocation problems which increase in complexity as the number of nodes in the system increases.
A number of generally applicable resource partitioning schemes have been implemented in computer systems. For example, U.S. Pat. No. 5,036,473 entitled "Method For Using Electronically Reconfigurable Logic Circuits" by Butts et al. describes a hierarchical partitioning scheme for a reconfigurable interconnection of logic chips. The system is designed to be partitioned into multiple clusters in accordance with a partitioning hierarchy which assigns design primitives to a box, board and logic chip, while satisfying system constraints. The hierarchical partitioning methodology initially places all primitives into a null cluster, and proceeds to form clusters by selecting a seed primitive from the null cluster and by moving primitives having the highest advantage function (a function that is specific to this implementation) into a cluster until it is full. This partitioning method is focused u on satisfying very specific system constraints, and proceeds by,assigning the smallest logical levels of the system to build clusters which ultimately define the partitioned structure of the system. The partitioning method is a logic partitioning method rather than a solution for allocating switching resource among disjoint processing node partitions. Moreover, while the disclosed methodology for building logic partitions on a logic element-by-logic element basis is well suited for the logic design described in Butts et al., it would prove error-laden and time consuming in other partitioned systems. For example, in systems in accordance with the focus of the present invention in which sets of disjoint nodes have been previously partitioned and wherein it is desired to optimize switch partitioning to allocate disjoint sets of switches to each node partition, an element-by-element method for the creation of switch partitions would require numerous attempts before achieving a workable albeit less than optimal switch partition allocation.
An article entitled Programmable Interconnection Switch Structure for Large Scale Machine Prototyping, published in The IBM Technical Disclosure Bulletin (TDB) Vol. 35, No. 1A June 1992 describes a method and system for providing a prototype environment for large scale digital system design. The article proposes the use of "soft-chips" such as field programmable gate arrays (FPGAs) to create a prototype system partitioned into "islands" of logic function used to create connections to switch chips. Signals traversing a switch chip from a logic source to a destination require one input pin and one or more output pin on the chip. Multiple routes may be stored and implemented over the shared connection resource on a time shared basis. A switch chip in this system may participate in any number of routes and is not constrained as in typical partitioned parallel processing systems to exclusive use within a single partition. Accordingly, the TDB does not offer a scheme for creating disjoint partitions as is required in a partitioned massively parallel processing system.
In a more recent TDB article entitled Multi-Stage Interconnection Network Topologies for Large Systems (IBM TDB Vol. 38 No. 10 October 1995), topologies for systems having 129-512 nodes are presented. The TDB discusses the inclusion of NSBs and ISBs of the type previously described, and a method for connecting them in 256 and 512 way systems, however it does not address the issue of partitioning the switching network to allocate switches among system partitions.
It is apparent from the foregoing that a mechanism for managing resource allocation by partitioning a switch network so as to accommodate disjoint partitions of processing nodes in a partitioned parallel processing system would prove useful to a system administrator attempting to manage a partitioned multinode system. Moreover, a need exists for such a mechanism in which implementation of the switch partitioning and allocation is balanced, optimal and satisfies a wide range of system partition configurations. These requirements as well as other advantageous features are addressed by the present invention.