1. Technical Field
The present invention relates generally to processors and computing systems, and more particularly, to a cache injection mechanism to accelerate input/output (I/O) operations in multiprocessor systems.
2. Description of the Related Art
Present-day multiprocessor systems are gravitating toward a truly distributed architecture. As opposed to the symmetrical multi-processing (SMP) or uniform memory access (UMA) model, in which memory and devices are typically available to and accessible by all processors within the processing system at a similar latency, non-uniform memory access (NUMA) systems generally include localized memory and devices that are either not readily accessible to all processors in the system or have a high performance cost due to higher latency. As such, when a direct memory access (DMA) transfer is made between one memory and another memory or I/O device, the memory image of the transfer may not be located such that the processor that processes the transferred data can quickly access the DMA memory image, as the DMA transfer target memory (the memory to which the transfer is made) may be located remotely from the above-identified processor. Data transferred into memory via DMA typically requires some processing, such as in systems where network packets are received by a network interface. Network packet processing typically requires multiple processing operations before the data is consumed by its end-point process. For example, the packets might be examined by an initial interrupt handler, followed by some communication protocol processing before reformatted packets are finally delivered to the process responsible for using the data included in the packets. Since cache misses require additional time to be handle, it would be desirable to provide methods to reduce cache misses due to various I/O operations in a multiprocessor system.
Further, in multiprocessor systems in general, cache memories (caches) are typically tightly coupled to a single processor or a small group of processors and supply the processor(s) with data from a system memory that may or may not be symmetrically available to each processor. In some multiprocessor systems, a central system memory is not present at all, and the entire memory consists of the collection of local memories of all of the processors. While each processor may access remote memories located in other processing nodes, such accesses are typically conducted through bus bridges or high-speed network interconnects for which access time to the memory is greatly increased over the access time for memory that is local or tightly-coupled to the accessing processor.
U.S. Pat. No. 6,711,650, the specification of which is herein incorporated by reference, discloses a mechanism for accelerating the processing of data transferred by a DMA transfer. The acceleration technique is known as “cache injection”. Data is transferred or “injected” into a cache during the DMA transfer into system memory, thus reducing or eliminating the delay associated with subsequently loading the data into cache for use by the processor. By directly loading existing cache lines that would otherwise be invalidated by the DMA write to the associated blocks of memory, the cache lines affected do not have to be marked invalid, thus avoiding the cache miss penalties that would otherwise occur and eliminating the need to reload the cache lines in response to the miss. Further, the above-described cache injection technique can avoid a cache load operation when space is available for allocation of new cache lines for DMA transfer locations that are not yet mapped into the cache. Also, when a cache line to be injected is not present in the cache and space is either unavailable or the cache controller is unable to allocate new lines for DMA transfer locations that are not already mapped, the controller need take no action; standard DMA transfer processing takes place and the memory is guaranteed to have the most up-to-date copy of the data.
Cache injection is thus very useful in uniprocessor systems, as the time required to process DMA operations is reduced overall, thus improving I/O device operations and operations where DMA hardware is used to transfer memory images to other memories. The cache injection occurs while the DMA transfer is in progress, rather than occurring after a cache miss when the DMA transfer completion routine (or other subsequent process) first accesses the transferred data.
However, using typical cache injection techniques such as that described in the above-incorporated patent, in a multiprocessor system such as an SMP or NUMA system provides additional challenges. In any multiprocessor environment, the cache loaded by the cache injection technique may not be located proximate the processor that executes the DMA transfer completion routine or other routine that operates on or examines the transferred data. In a NUMA system, the memory image from the DMA transfer may not be in a memory that is quickly accessible to the processor that consumes or processes the transferred data. For example, if the data is transferred to the local memory of another processor, accesses to the memory address ranges corresponding to that “remote” memory (i.e., the local memory of the other processor) would typically require transfer via a high-speed interconnect network or through a bus bridge, increasing the time required to access the data for processing.
Further, the processor cache or outboard cache of the process that eventually processes the DMA-transferred data might not be loaded by the cache injection mechanism described in the above-incorporated patent or if a cache was loaded, it could be a cache coupled to the wrong processor (i.e., the processor that does not execute the DMA transfer completion routine).
It is therefore desirable to provide a method and system that can manage cache injection in a multiprocessor system. It would further be desirable to provide such a method and system that can manage cache injection in both UMA and NUMA multiprocessor systems.