In the field of information processing, it is well known to cache data locally, since cache access can be significantly faster than access from a memory location. Information processing systems will typically cache data which is frequently accessed and/or data that is expected to be required next by a processing unit, based on related data most recently requested by the CPU. When the processing unit requires data, yet is unable to access that data from a local cache, the event is referred to as a “cache miss”. Performance analysis work indicates that cache misses in the performance critical paths hurt the overall performance of the system posing both latency and throughput issues. Mechanisms that avoid cache misses in the critical paths are therefore important.
Present-day memory subsystems consist of a hierarchy of memories of differing speeds and sizes. There are multiple levels of caches, with the caches closest to the processor being of decreasing size and increasing speed, followed by a main memory that offers significantly higher access latency and lower throughput. The processor may prefetch a memory location to its cache based on anticipated processing needs. When such a prefetch occurs, the address of the memory location can be found in the cache. Alternatively, the address of a memory location may be found in a cache based on the processor having recently accessed the memory location directly. In either instance, the presence of the address in the processor's cache indicates that the processor in “interested” in the memory location and is, accordingly, interested in any updates to that memory location. Any memory updates to the memory location are written to the address of the memory location in the main memory, but should also, preferably, be reflected at any cache location that shows the memory address.
Further, in multi-processor nodes, each processor may have one or more dedicated levels of caches, any one of which may have previously-read data from a shared memory location. If the data has been updated at the shared memory location, the cache may include stale data. Cache coherence must be maintained, ensuring that two processors that share memory get coherent data and that an update to shared memory by one processor is reflected in future accesses by other processors.
In the shared memory model, the processors may share one memory subsystem that performs data transfers and cache coherence. Alternatively, each processor in a multi-processor node may have its own associated memory subsystem. In a multi-node system or network, each node will generally have its own memory subsystem including an adapter for exchanging data with other nodes and for updating memory locations with data received from local processors or other nodes. FIG. 1 provides a block diagram of a multi-node system, including Node1 at 100 having processor at 102, processor cache locations 103-106, main memory location 110 and adapter/memory subsystem 109. Node2 at 120 is a multi-processor node having processor 122 with associated processor cache locations 123-126, processor 142 having processor cache locations 143-146, shared memory 130 and adapter/memory subsystem 129. The adapter 109 at Node1 can send data across network 150 to adapter 129 at Node2.
There are multiple memory transfer paths where current memory subsystems provide less than optimal performance and may cause significant overhead. One such path is in interprocessor communications among processors at different nodes across the network wherein a source processor on a first node sends data to a target processor on a second node. With reference to FIG. 1, assume data has been updated by processor 102 on Node1. The data is provided by source processor 102 to the Node1 adapter 109 to be communicated to a target processor, 122 on Node2. In many instances, as noted above, the target processor 122 may be waiting for the data to arrive and may be polling on the memory location, for example cache line 123, where it expects updated data to be placed. Because the memory location is being repeatedly accessed, it is likely that the address of the memory location is in the target processor's cache. When the data arrives on the target node, the target node's memory subsystem, the adapter 129 as illustrated in FIG. 1, can write it directly to memory 130 using a Direct Memory Access (DMA) engine (not illustrated).
However, due to the memory subsystem's coherency protocol, the write to the memory location in memory 130 automatically results in an invalidation of the location in the target processor's cache 123. The polling target processor 120 then needs to go to the main memory 130 again to fetch the data updated by the memory subsystem, adapter 129. The foregoing process, involving a step of a DMA to memory by the adapter, a step of the memory subsystem invalidating the cache, and a step of fetching the new data from memory by the target processor, can result in performance degradation and delays in the communication path, thereby increasing the latency of inter-node communications significantly.
It is, therefore, an object of the present invention to provide a system and mechanism for efficient data transfer between processors.
It is another object of the invention to provide efficient cache injection for user space message passing of messages of varying size and latency.