1. Field of the Invention
This invention generally relates to digital processing devices and, more particularly, to a system and method for caching control messages between peripheral devices and a processor.
2. Description of the Related Art
General purpose processor performance is measured simply as the time to execute a software program. The executed software program is made up of a finite set of instructions. The processor executes the software program using some number of clock cycles per instruction (CPI), where a cycle is based on a specific time interval called the cycle time. Multiplying the number of instructions to be executed by CPI, by cycle time, results in the execution time of the program. The resulting number is the processor performance.
The CPI of the processor is determined by a set of variables. A software program is made up of a combination of different instruction types including load/store instructions, data manipulation instructions, and comparison instructions. Each instruction type may require a different number of cycles to execute. Certain instructions, namely load and store operations, are dependent on outside factors, and the number of cycles to be performed is unknown. This unknown wait-time factor is referred to as the latency in satisfying the load or store operations.
A modern high performance CPU uses several techniques in order to reduce the number of cycles per instruction. These techniques attempt to exploit instruction level parallelism by executing non-dependent code sequences in parallel and out of order with respect to each other. This parallel execution is commonly referred to as superscalar execution. Another common technique is to exploit out of order-ness with respect to load and store operations and the actual completion of these operation to the memory system. This technique is commonly referred to as a weakly ordered memory system. However, certain control aspects of computing require that load and store operations complete in the strict order that they were issued by the software code. This is especially true if the software being executed by the processor is communicating with a peripheral input/output (IO) device. Forcing the ordering of operations in an out-of-order processor with weakly ordered memory system causes certain performance degradation and reduces the average CPI of the processor, thus lowering execution performance.
As an example, a software driver code may be required to set up a direct memory access (DMA) engine using a series of load and store operations to a set of registers. This set of load and store operations is referred to as Programmed IO (PIO). For such operations, strict completion ordering is required to make sure that the DMA engine is programmed correctly. These operations, therefore, can be thought of as being carried out in an atomic manner.
If the software program is dominated by a high ratio of PIO to computational code, then the overall performance is impacted by how efficient the PIO is carried out. As stated earlier, such PIO operations are usually dominated by the latency in accessing the remote device. As processor frequency increases, the resulting latency increases linearly. If nothing is done to reduce this latency then the overall performance scaling suffers. Therefore, new techniques must be deployed in order to reduce the dependency on PIO for the overall performance of the processor.
Another aspect impacting performance is the communication of events from the peripheral IO device to the software. This communication is typically done using either an interrupt or polling mechanism. In the case of an interrupt, the processor suspends the current code execution and proceeds to execute the interrupt service routine (ISR). The ISR usually requires several load operations to capture status information about the event from the IO device, followed by some store operations to reset the status of the peripheral.
The modern microprocessor makes use of a hierarchy of one or more caches to help reduce the load/store latency impact to performance for code or data structures that are often accessed. Processor caches were devised to reduce the average access latency for software memory references, as applied to the Harvard Architecture based processor.
A cache is a temporary collection of digital data duplicating original values stored elsewhere. Typically, the original data is expensive to fetch, due to a slow memory access time, or to compute, relative to the cost of reading the cache. Thus, cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, the cached copy can be quickly accessed, rather than re-fetching or recomputing the original data, so that the average access time is lower.
Caches have proven to be extremely effective in many areas of computing because access patterns in typical computer applications have locality of reference. A CPU and hard drive frequently use a cache, as do web browsers and web servers.
FIG. 1 is a diagram of a cache memory associated with a CPU (prior art). A cache is made up of a pool of entries. Each entry has a datum or segment of data which is a copy of a segment in the backing store. Each entry also has a tag, which specifies the identity of the segment in the backing store of which the entry is a copy.
When the cache client, such as a CPU, web browser, or operating system wishes to access a data segment in the backing store, it first checks the cache. If an entry can be found with a tag matching that of the desired segment, the segment in cache is accessed instead. This situation is known as a cache hit. So for example, a network routing program might need to look up a route entry in a table at a particular address in memory. The hardware first checks the cache tag to see if a copy of the entry is already resident. If so, then the request is serviced directly from the segment pointed to by the tag and a longer memory access latency is avoided. Alternately, when the cache is consulted and found not to contain a segment with the desired tag, a cache miss results. The segment fetched from the backing store during miss handling is usually inserted into the cache, ready for the next access.
When a data segment is written into cache, it is typically, at some point, written to the backing store as well. The timing of this write is controlled by what is known as the write policy. In a write-through cache, every write to the cache causes a write to the backing store. Alternatively, in a write-back cache, writes are not immediately mirrored to the store. Instead, the cache tracks which of its locations (cache lines) have been written over. The segments in these “dirty” cache lines locations are written back to the backing store when those data segments are replaced with a new segment. For this reason, a miss in a write-back cache will often require two memory accesses to service: one to retrieve the needed segment, and one to write replaced data from the cache to the store.
The data in the backing store may be changed by entities other than the cache, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the client updates the data in the cache, copies of that data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as coherency protocols. CPU caches are generally managed entirely by hardware.
In contrast to cache, a buffer is a temporary storage location where a large block of data is assembled or disassembled. This large block of data may be necessary for interacting with a storage device that requires large blocks of data, or when data must be delivered in a different order than that in which it is produced, or when the delivery of small blocks is inefficient. The benefit is present even if the buffered data are written to the buffer only once and read from the buffer only once. A cache, on the other hand, is useful in situations where data is read from the cache more often than they are written there. The purpose of cache is to reduce accesses to the underlying storage.
As noted above, caching structures are often used in computer systems dealing with persistent data. The processor loads the data into the cache at the start of, and during processing. Access latencies are improved during processing as the cache provides a store to hold the data structures closer to the processor than the main memory. The conventional cache line replacement algorithms select segments based upon the order in which elements were loaded or accessed within the cache. However, these replacement algorithms are not necessarily efficient for transient data. Conventionally, transient data is either located within the main (off chip) data store and/or within on-chip buffers or queues. The management of these on-chip resources can be complicated with the sizing of on-chip storage. It is difficult to determine and map the different addresses required between the on-chip and off-chip stores.
Allocation of data into the cache is normally done based on a load or store reference by software, executing on a computer processor unit (CPU), to a specific address region that is marked as “cacheable”. Whenever a cacheable address is referenced, a cache controller first looks up the address in the cache tag. If the address is not currently in cache, then the cache controller permits the memory access to continue to the next level of the memory system, to fetch the required data (cache line). At some later point, the data is loaded into the cache along with completing and satisfying the original software request for a portion or all of the data. A typical processor implementation allocates data into a cache by reading or writing a data element to/from memory that is marked as cacheable. The memory subsystem brings a copy of the memory into the cache as it is being delivered to the processor.
However, the above-described caching scheme is inefficient for embedded communications processing, as time is wasted waiting for transient data to be loaded into the cache. For example, in packet processing, an ingress packet is first written to a data buffer in main memory. Subsequently, the software being executed by the processor is alerted by an Ethernet DMA engine that a packet has been posted, usually by means of an interrupt. The processor takes the exception and software reads some status and control information in the Ethernet controller to determine the reason for the interrupt. Next, the executing software begins reading the packet header to perform packet classification. All of these reads are high in latency and serialize the packet processing time.
To combat the inefficiencies in the cache replacement of transient data, a cache “stashing” technique may be employed that prevents elements in cache from being replaced in accordance with an LRU replacement policy until “unlocked” by an external processor. Cache stashing is a technique where another processing element (such as a DMA engine) allocates a cache line into a cache that belongs to another processor, on behalf of that processor, based upon the assumption that the processor will use the data at a later time. Rather than waiting for the executing software to “touch” a particular address in order to allocate it into a cache, the cache controller is modified to allow DMA agents to allocate data into the cache. This allocation means that when a DMA agent is writing data to memory, it marks the transaction as “stash-able.” The stash-able marking indicates to the cache controller that the data elements can be put into the cache while the memory system is pushing the data to main memory. Later on, when software goes to access the packet data, the packet data is already present in the cache, thus eliminating some of the latency that would have otherwise occurred in fetching the data all the way from main memory.
By moving a copy of packet data closer to the processor temporally, the access penalty can be reduced. While the concept has been applied to the generic movement of data from DMA agents to main memory, the technique is not directly applicable to control and status registers, which must always reflect the current state of the remote peripheral. Therefore, PIO can rarely leverage the advantage of the cache hierarchy.
For example, a peripheral may have many control and status registers associated with it. Conventionally, software execution must perform load and stores atomically to these registers in order to access and/or control the peripheral. The latency and overhead to deal with these operations is becoming a bigger contributor to the performance scaling.
It would be advantageous if control and status register information could be allocated to cache in a manner similar to the way raw data structures are allocated.