The invention relates to a bridge for coupling a requesting interconnect and a serving interconnect connected to a number of coherent units in a computer system. Moreover, the present invention relates to a method and to a computer program for coupling a requesting interconnect and a serving interconnect connected to a number of coherent units in a computer system.
The present bridge is configured to provide a load/store path for inbound requests between interconnects with ordering requirements. For example, the bridge may be arranged between an I/O bus, like PCI Express, as a requesting interconnect on a requesting side (also called south) and a serving interconnect, e.g., a snooping-based coherent interconnect, on a serving side (also called north).
I/O devices or I/O buses, like PCI Express, are based on strong ordering requirements, defining that in particular read requests may not pass write requests that were issued ahead of the read request. As a result, the read request is guaranteed to not receive stale information in case it would access the same data that is modified by a previous write request.
In this regard, FIG. 1 shows a schematic block diagram of an example of a bridge 10 coupling an I/O device 200 and a coherent processor interconnect 300. The coherent processor interconnect 300 couples a plurality of processing units 401 to 404 and a memory controller 500, for instance.
The coherent processor interconnect 300 may be a snooping-based coherent interconnect which may include the possibility for a request (command) to be retried. The necessity of a retry may be caused by missing resources in the coherent units 401-404, 500 attached to the coherent processor interconnect 300 and potentially responsible to handle the request, e.g., when all the request queues of the memory controller 500 are already taken by other requests, or by address conflicts when a request for the address is currently already being processed in the coherent processor interconnect 300 and the address is protected against other operations of the coherent units 401-404, 500, 100 involved in the transfer.
Depending on the implementation of the logic of the bridge 10 attached to the I/O device 200—which may also be called south interface—also the responses returned for load requests from the south interface 200 may require retries when the logic of the bridge runs out of buffer space, e.g., because of delayed credit returns between the I/O bridge 11 and the I/O host stack 12.
Moreover, a bridge as shown in FIG. 1 for handling loads or writes from the I/O device 200 may have to support strong ordering requirements of write requests and also read requests. The read and write requests (load and store requests) are received by the bridge at its south interface.
In particular, for good performance, it may be critical that the read requests are kept in order as well as possible in order to avoid head of line blocking for the southbound read responses. For example, in PCI Express, e.g., the maximum transfer unit (MTU) sized data responses need to be returned in order. This means that, for example, for a 4 kB read request, and a read response MTU of 256 B, there are 16 response packets created, requiring 64 or 32 reads on the coherent interconnect, depending on the cache line's size that is typically 64 or 128 bytes.
Any cache lines that are returned on the southbound interface ahead of the 2 or 4 cache lines required for assembling the first response packet while any of this data is already available incurs additional latency and blocks buffers in the southbound interconnect from being reused for new requests.
Without the possibility of and without different response latencies in the coherent interconnect, a simple FIFO (First-In-First-Out) implementation may be used. As there can however be any combination of varying latencies and potential retries, a FIFO implementation that can keep the optimal scheduling order is too complex to implement with an increasing number of machines. Another option may be using bit vectors for tracking the ordering between all machines. This implementation however scales exponentially with the number of active machines (also instantiated machines), which makes it prohibitive to implement with the increasing bandwidth requirements.
U.S. Pat. No. 7,996,625 B2 describes a method for reducing memory latency in a multi-node architecture. A speculative read request is issued to a home node before results of a cache coherence protocol are determined. The home node initiates a read to memory to complete the speculative read request. Results of a cache coherence protocol may be determined by a coherence agent to resolve cache coherency after the speculative read request is issued.
U.S. Pat. No. 7,600,078 B1 describes a method for speculatively performing read transactions. The method includes speculatively providing a read request to a memory controller associated with a processor, determining coherency of the read request in parallel with obtaining data of the speculatively provided read request, and providing the data of the speculatively provided read request to the processor if the read request is coherent. In this way, data may be used by a processor with a reduced latency.
Accordingly, it is an aspect of the present invention to improve bridging between a requesting interconnect, like an I/O bus, and a serving interconnect, like a processor interconnect.