The present invention relates to methods and apparatus for parallel multiprocessor computer systems and more specifically to a multiprocessor node-controller circuit and method.
Multiprocessor (MP) systems are computing systems comprised of a few or up to hundreds or thousands of processing elements (PEs). While the power of a multiple-instruction multiple-data (MIMD) MP computer system lies in its ability to execute independent threads of code simultaneously, the inherently asynchronous states of the PEs (with respect to each other) makes it difficult in such a system to enforce a deterministic order of events when necessary. Program sequences involving interaction between multiple PEs such as coordinated communication, sequential access to shared resources, controlled transitions between parallel regions, etc., may require synchronization (such as barrier and/or eureka synchronization) of the PEs in order to assure proper execution. One such invention having routers, networks, and synchronization apparatus and methods is described further in copending U.S. Pat. No. 6,085,303, issued Jul. 4, 2000, entitled xe2x80x9cSERIALIZED, RACE-FREE VIRTUAL BARRIER NETWORKxe2x80x9d.
Some MP systems having symmetric distributed multiprocessors use a coherent model of cache. One such system is described in application Ser. No. 08/971,184 filed Nov. 17, 1997 entitled xe2x80x9cMULTI-DIMENSIONAL CACHE COHERENCE DIRECTORY STRUCTURExe2x80x9d.
There is a need in the art for an improved node controller apparatus and method to improve communications between various portions of an MP system. Further, there is a need for a node controller that will xe2x80x9cscale wellxe2x80x9d providing excellent performance-cost benefits for both small and large systems. Further, there is a need for a node controller that has very high flexibility, performance and speed.
The present invention provides a method and apparatus that facilitates highly parallel processing. The present invention includes a node controller usable in both small and large multiprocessor systems, and that provides superior performance-cost benefits across a large range of system prices and capabilities. In some embodiments, this node controller is implemented on a single chip that provides two or more processor ports, each supporting single-processor and/or multiprocessor subsystems (each optionally including local cache memories), as well as one or more of the following port types: input/output (I/O), memory, directory, and network interface.
Traditionally, distributed multiprocessors are built using a separate directory controller along with a memory controller, connected to the network controller, the input/output interface, and processors. In various embodiments of the present invention, the memory controller (that optionally includes a directory controller that provides cache coherence functions) and the I/O controller and the network controller and put them all on one chip that includes a plurality of processor ports. This provides a couple of advantages. First, transmissions between any of the nodes are direct, on chip, and are implemented using a single protocol, so that transmissions do not have to traverse as many chip boundaries. Second, by imbedding all of this onto a single chip, a full crossbar design is utilized inside the chip. This provides non-blocking communication whereby a remote node can talk directly to the local node""s memory while the local node is talking to its I/O system with no queuing between those communications. In contrast, on a bus-type system, one of the communications would have to wait for the other to complete. These can go on simultaneously in embodiments of the present invention that use a crossbar. Further, by building all of the stuff into a single chip, it is more cost effective to build a smaller system out of this same architecture because there is not the overhead of having a lot of extra chips to support a large system configuration when one is not building a large system.
A first aspect of the present invention provides a multiprocessor computer system (for example, a small multiprocessor system having only two node controllers connected to one another, or a multiprocessor system having up to hundreds or thousands of node controllers connected together through a router network). One such embodiment of the system includes a first node controller, a second node controller, a first plurality of processors operatively coupled to the first node controller, a second plurality of processors operatively coupled to the second node controller, a first memory operatively coupled to the first node controller, a first input/output system operatively coupled to the first node controller, and an interprocessor communications network operatively coupled between the first node controller and the second node controller. In this embodiment, the first node controller includes: a crossbar unit, a memory port operatively coupled between the crossbar unit and the first memory, an input/output port operatively coupled between the crossbar unit and the first input/output system, a network port operatively coupled between the crossbar unit and the interprocessor communications network, and a plurality of independent processor ports, including a first processor port operatively coupled between the crossbar unit and a first subset of the first plurality of processors, and a second processor port operatively coupled between the crossbar unit and a second subset of the first plurality of processors. In some embodiments of the system, the first node controller is fabricated onto a single integrated-circuit chip.
In some embodiments of the system, the memory is packaged on a plurality of plugable memory/directory cards wherein each card includes a plurality of memory chips including a first subset of memory chips dedicated to holding memory data and a second subset of memory chips dedicated to holding directory data. Further, the memory port includes a memory data port including a memory data bus and a memory address bus coupled to the first subset of memory chips, and a directory data port including a directory data bus and a directory address bus coupled to the second subset of memory chips. In some such embodiments, the ratio of (data space in the first subset of memory chips) to (data space in the second subset of memory chips) on each of the memory/directory cards is set to a value based on a size of the multiprocessor computer system.
In some embodiments of the system, the crossbar unit selectively combines two serially received doublewords of data into a single quadword micropacket for transmission through the crossbar unit, and wherein each doubleword contains at least 64 bits of data and the single quadword contains at least 128 bits of data.
Another aspect of the present invention provides a method usable with one or more of the above described systems. The method includes transmitting data between the memory port and the first processor port, between the memory port and the second processor port, between the memory port and the input/output port, and between the memory port and the network port.
Some embodiments of the method further include transmitting data directly between the first node controller and the second node controller that are directly connected to one another by the interprocessor communications network.
Some embodiments of the method further include transmitting data indirectly between the first node controller and the second node controller through a router chip that is also connected to one or more other node controllers.