The continued demand for high performance computers and/or computer systems requires optimum usage of the available hardware and software. One such approach is the implementation of the use of processing nodes each comprising one or more microprocessors and memories. These computer systems are sometimes referred to shared multiprocessor systems. In a shared multiprocessing computer system, the nodes are interconnected to each other so that they can communicate with each other, share operating systems, resources, data, memory etc.
One of the goals of building a modern computing machine employed at the Enterprise level include having enough system capacity to take the many different workloads and applications running in a distributed computing environment such as a server farm and migrate them onto a large monolithic host server. The benefit of consolidating workloads and applications from many small machines to a larger single one is financially motivated to reduce the number of system operators, amount of floorspace and system maintenance costs. System integration vendors have been pushing the SMP size envelope, integrating up to 64 or more processors in a tightly coupled shared memory system in a variety of coherent inter-processor connect topologies.
The commonly available designs in the Unix platform include topologies where integrated processor-memory nodes, or simply nodes, are interconnected by means of multiple parallel common directional loops with distributed switch network (topology A), Central crossbar switch (topology B), or tree-based hierarchical switch (topology C). All of the above-mentioned topologies can be built to achieve the large scalability goal of a modern computing machine, but at the expense of incurring lengthy node to node access latency times, as measured in the number of node hops, adversely affecting system performance.
When a processor demands a unit of storage data which is not present in its node's internal cache system the data request is broadcast out to snoop all other nodes in the system to locate the latest version of the data. This data request, or address broadcast snoop, traverses the entire topology to find every node and snoop their cache content for an address match. The collective snoop results are then combined and are acted upon as an arbitration means by which the appropriate node is selected to source the data. A storage coherency scheme can be devised that will source data early from a node without waiting upon the collective snoop results. If the same data exists in multiple nodes' caches only one node would source the requested data.
Upon implementing the described sequence for processing a data fetch request on a 4-node system as an example, the address snoop on topology A propagates around a ring snooping every node in the process and eventually circles back on the requesting node. The snoop results from each of the node are gathered back on the requesting node and then broadcast out on the ring to identify which node will source the data.
Again, a storage coherency scheme can be devised where data existing in a node's cache can be sourced on the initial snoop broadcast without needing to wait for the collective snoop result broadcast. The access latency on topology A for the early data case from snoop launch, assuming data routing is optimized for the shortest return path, is an average of 3.33 node to node crossings or node hops. For the late data case which relies on the collective snoop results the average latency is 7.33 node hops.
In topology B, the fetch request is launched to the central crossbar switch and from there it is broadcast to the other 3 nodes. The snoop results from the nodes are then collected on the central crossbar switch and broadcast out to the all nodes. The calculated average early data latency in topology B is therefore 4 node hops treating the node to central crossbar switch crossing as a node hop, and the average late data latency is 6 node hops.
In a tree based hierarchical topology such as in topology C with 4 nodes, optimally the topology would appear similar to topology B and therefore would have the same latency. A taller tree based hierarchy would lengthen the early and late data latencies by 2 node hops for each switch level that is added.
Accordingly, it is desirable to provide a bus protocol on a nodal interconnect topology that allows for both high overall system performance and availability.