The ever-expanding requirements for processing-intensive computer applications are driving the market to produce systems of ever-greater power. Unfortunately, improvements in processor technology, though impressive, are insufficient to satisfy all of this demand.
One alternative possibility for creating a system with increased power is to operate several closely coupled processing nodes in tandem. Though each node operates in its own local memory space, the close coupling necessitates a degree of memory sharing. This shared memory can be implemented as a single central copy, or (more typically) replicated and distributed in the nodes' local memory. Either way this gives rise to the need for a high bandwidth inter-node communication system, in the former case to provide access to the central memory, and in the latter case to ensure that the distributed copies are kept coherent.
A node generating traffic through this communication system will frequently require a reply to its request before processing can continue. Thus, either the node must suspend processing, or (where possible) it must switch to another task which is not so stalled—either option will cost overall performance. Low latency in the inter-node communication system is therefore a prime requirement to minimize such loss.
In data communications systems, cell loss can be handled by higher layers in the protocol stack and can therefore be tolerated. By contrast, cell loss in processor interconnect systems is generally unacceptable due to the stalled requesting process, yet such systems typically operate with a minimum of protocol layers in order to keep down system latency. The physical layer must therefore implement a reliable delivery protocol in hardware.
In WO 00/38375, the disclosure of which is incorporated here in its entirety, we proposed a data switching apparatus which possesses inherent attributes of high bandwidth, scalability, low latency, small physical volume and low cost. Only limited details of this technology had been made publicly available by the priority date of the present application. It is illustrated in FIG. 1.
A switching system employs a number n+1 of routers, which may be di-directional. The information transmission aspect of the respective routers is expressed as “ingress routers” ITM0, ITM1, . . . ITMn. The information receiving aspect of the routers is expressed as the n+1 “egress routers” ETM0, ETM1, . . . ETMn. Each router receives information from one or more data sources (e.g. a set of processors constituting a “node”), e.g. ingress router ITM0 receives information from m+1 data sources ILE00, . . . ,ILE0m. Similarly, each egress router sends information to one or more data outputs, e.g. egress router ETM0 sends information to data sources ELE00, . . . ELE0m. The master device SC and matrix device(s) SW constitute the central interconnect fabric (CIF). Cells for transmission through the matrix SW are of equal length, and are each associated with a priority level. Each ingress router maintains, for each egress router and for each priority level, a respective “virtual output queue” of cells of that priority level for transmission to that egress router when the matrix device SW connects that ingress router to that egress router. Each ingress router sends connection requests to the master device SC. The master device SC determines which ingress and egress routers to connect by a first arbitration process. Each ingress router, having been informed of which egress router it will be connected to, performs a second arbitration to determine which priority level of cell it will transmit to that egress router, and having determined the priority level, transmits the head of the virtual output queue for that priority level and that egress router to the matrix SW via the serial links to arrive at the same time as connection information sent directly from the master. In practice, the latter is significantly quicker than the former, and has to be artificially delayed in order to match the latency of the path via the router. In summary, the above system uses a memoryless fabric with all congestion buffering in the routers.
WO94/17617 discloses a switch according to the preamble of claim 1. The switching matrix includes a buffer which only is capable of storing one cell for each path through the switching matrix which can be formed during a singel switching cycle. Following a determination of which cells are to be transmitted through the switches, those cells leave the ingress routers, and are transmitted through the switching matrix. This passage includes temporary storage of the cells in the buffer as they pass through the cell, and at a time when they are no longer stored in the ingress routers.