A computer system typically includes one or more central processor units (CPUs), also referred to simply as processors. Processors are well known to those of ordinary skill in the art. One known technique used to improve CPU performance involves the use of caches. Two types of caches are typically used. The first type of cache is known as an L1 (Level-1) cache which typically resides in the processor core and improves performance by providing speedy access to data and instructions resident in the L1 cache(s). This saves the processor core from having to access main memory to obtain the desired data. L1 caches are typically small in order to achieve the desired speed. The second type of cache is known as an L2 (Level-2) cache. The L2 cache is larger, slower and requires larger access time than L1 cache, while still being smaller, faster and requiring less access time than main memory. Since the L2 cache is faster than memory, the L2 cache saves the processor from having to access main memory if the desired data is not in the L1 cache but is resident in the L2 cache. It is clear that such cache hierarchies are not limited to two levels but can be extended with larger and slower Level 3 caches, Level 4 caches and so on.
Another known technique for improving CPU performance involves a mechanism referred to as a push cache. Push cache is an architectural feature that allows devices or other processors to utilize cache push operations to push or write data directly into one or more of the CPU's caches while maintaining coherency with the main memory. The processor utilizes the cache data and avoids costly memory accesses thereby increasing the overall performance of the system. Devices designed to push selected data into a processor's cache(s) perform the push operation irrespective of the effect of the operation on the overall performance of the system. Thus, in certain scenarios, use of the cache push mechanism may significantly degrade system performance, for example by displacing from the cache previously pushed data that a program running on the CPU has not yet consumed; if for example the data is stored in a data structure that is accessed in First In/First Out (FIFO) order then the displaced data will again be moved back into the cache and accessed before the most recently pushed data is accessed, thus causing extra unnecessary delays and memory traffic.
Referring now to FIG. 1, a prior art system 10 supporting cache push operations is shown. A device 60 issues a write to memory 40 with control bits set to indicate that it is a cache push write operation. The bridge 50 updates the L2 cache 30. The L2 cache 30 updates the memory 40 later in order to maintain coherency between the memory and the L2 cache. Any data displaced or victimized from the L2 cache 30 as a result of the push operation is discarded or written by the L2 cache 30 to memory 40 as necessary in order to maintain coherency in the memory image. Any processor 20 accesses to the updated data would find the data available in the L2 cache 30, thereby avoiding a cache miss by the processor core in L2 cache 30. Alternately, the push operation updates both the L2 cache and main memory together. As should be clear to one skilled in the art the memory 40 might be a next-level cache accessed in common between the CPU and the device.
While the cache push mechanism offers significant performance gains, the efficiency of the cache push mechanism is dependent on two factors. A first factor is the timeliness of the data pushed. A second factor is the cost of victimizing other L2 cache entries as a result of a cache push operation.
The timeliness of the data pushed comes into effect in certain scenarios. In a cache push mechanism system, cache misses are reduced by proactively placing data into the L2 cache 30 thereby avoiding main memory accesses. However, if pushed cache data does not get accessed soon enough, the pushed cache data could end up being a victim of cache replacement, thereby nullifying any gain achieved by the cache push operation and instead incurring additional bus traffic due to the displacement of the pushed cache data and potentially also due to an update to memory with the cache data that was initially displaced by the push operation. Such cache replacement may occur due to processor demand wherein the CPU requests data which is not currently in the cache and the line containing the pushed cache data is displaced by it, or can also occur when another cache push operation occurs and previously pushed cache data is displaced by that act.
In prior art system 10 the peripheral device 60 might be configured to deliver received network packets into buffers in memory 40 along with packet metadata into descriptors also in memory 40, with both the packet data and the associated descriptors to be accessed by a network device driver running on the processor 20. A typical communications mechanism between a network interface such as gigabit Ethernet engine 60 and a processor 20 is one or more FIFOs implemented as data structures in memory 40, where the network interface writes to the tail of the FIFO and the processor 20 reads from the head of the FIFO. In a system implementing cache push capability the network interface might further be configured to push packet descriptors and some or all of the contents of the packet buffer to the cache(s) on one or more CPUs.
The peripheral device 60 may be provided as a gigabit Ethernet engine and the processor 20 may be a network processor. A gigabit Ethernet engine used with a network processor 20 typically implements an interrupt moderation scheme to ensure efficient packet processing at high packet rates. The interrupt moderation scheme ensures that the inter-interrupt interval increases as the packet rate increases. These schemes, while reducing the interrupt rate, also increases the number of packets and descriptors accumulated per interrupt. In turn, the potential of a pushed descriptor being victimized increases as the number of descriptors pushed between interrupts is concomitantly higher.
In normal operation of such a network interface and related device driver, the queue in memory provides for elasticity, or “smoothing” of the arrival rate of packets; the device driver does not have to keep up with the arrival of packets on a packet-by-packet basis but rather only needs to budget such that the processing time on average does not exceed the inter-arrival time on average of packets. Victimization of pushed but unaccessed packets and descriptors from the head of a queue due to subsequent cache push operations to the tail of that queue occurs when the cache is not dynamically able to contain the required depth of elasticity at a particular point in a bursty arrival of packets.
The cost of cache victimization due to cache push operations also comes into effect in certain scenarios. The cache unit treats a cache push operation similar to a cache line replacement operation from the processor core. A cache push operation could result in a cache line being victimized, the cache victim selected by the cache's particular replacement algorithm. If the location represented by the victimized cache line is of current or future interest to the code running on the processor 20, then the victimized cache line would subsequently be brought back into the cache as a result of an access by the processor 20 from memory 40. Thus, in this scenario, the push operation could result in a net increase rather than decrease in cache misses, thereby negatively impacting the overall system performance.
Thus, in systems that implement cache push operations, the use of cache push operations could be counterproductive, especially under high packet rate and high core processor load conditions. Using cache push operations under both these conditions could result in additional cache misses and associated main memory accesses. As the processor continues to receive data on its I/O interfaces, more data gets pushed, with the push operation increasingly victimizing the current working set resulting in more cache misses and more memory accesses. This cycle could continue eventually bringing the system to a halt under heavy load and I/O or causing packet loss due to increased processor stalls.