1. Field of the Invention
This invention relates in general to the field of microelectronics, and more particularly to an apparatus and method that enable a programmer to direct a microprocessor to perform write back and invalidate operations to flush a specified number of cache lines from its internal cache.
2. Description of the Related Art
In a present day microprocessor, the speed at which data can be transferred between internal logic blocks is an order of magnitude faster than the speed that external memory accesses exhibit. In an x86 desktop computer configuration, the bus which interfaces to system memory operates on the order of hundreds of megahertz while internal microprocessor clock speeds are approaching the tens of gigahertz. Consequently, a hierarchy of cache structures has evolved over more recent years to allow high-performance microprocessors to run at speed without having to execute transactions over a slow memory bus every time data is to be read or written.
An on-board, or local, cache in a pipeline microprocessor is a stand-alone unit in nature that operates in a manner that is transparent to the flow of instructions in the microprocessor pipeline to ensure that data which is going to be required by instructions in an application program is resident within the cache and can hence be accessed at pipeline speeds rather than at memory bus speeds. Cache architectures are varied in the art, some consisting of multiple levels of cache: a first level cache very near to a processor's execution logic, a second level cache which may be on-chip or off chip where less frequently accessed data are stored, third-level caches which may be on a memory card itself, and so on. Regardless of whichever architecture is employed, one skilled in the art will appreciate that the overall objective of a cache is to preclude the case where instructions in a microprocessor pipeline are stalled while a bus transaction over a slow memory bus is issued in order to obtain data that is needed by a pending read or write operation. When this happens, program execution comes to a grinding halt until the data arrives.
To complicate matters, present day computer systems also share regions of memory between devices. For instance, a primary microprocessor communicates with a communications microprocessor by reading data from and writing data to a specified region of memory. It is also very common to find areas of memory known as video buffers that are shared between a primary microprocessor and a microprocessor on a video card that is responsible for displaying data to an operator on a video device.
In an shared memory system, the possibility exists that data from a shared region can be in the local caches of two different microprocessors or other devices connected to a memory bus. And as long as all of the devices do nothing more than read the data, then no harm is done by allowing both of them to have the data in their local cache structures. But unpredictable consequences could result if they were both allowed to modify the data that is locally cached.
To prevent this from happening, system designers have developed cache consistency protocols that indicate the status of data in a cache. The most prevalent protocol is known as MESI. Maintaining a local cache according to MESI ensures that two copies of the same data cannot be modified at the same time. A MESI shared state indicates to a local cache whether or not a particular block of data is shared. If it is shared, then the local processor cannot modify it without first conducting transactions over the slower memory bus to obtain exclusive permission to modify the data. To modify data, a processor must first obtain exclusive ownership of the data.
The problem addressed by the invention described herein relates to the delays that are incurred in a program where it is desired to write back data that has been modified in a local cache to memory. One skilled in the art will appreciate that a cache has no reasonable way of knowing when a particular area of memory will first be required, and so memory bus delays are always incurred when the particular area of memory is first loaded into local cache. Recognizing this fact, designers have developed a prefetch instruction that can be programmed for execution by a microprocessor. The prefetch instruction, however, does not operate on operands in the program flow. Rather, the prefetch instruction is directs a local cache to load operands from memory into the cache that will be used in the future. And since cache units operate with memory in a manner that is transparent to instruction flow in a microprocessor pipeline, it is prudent to issue a prefetch instruction prior to needing data so that the cache can fetch the data from memory prior to the time it is needed—in parallel with the execution of other instructions in the primary program flow. Then when a subsequent instruction comes along that needs access to the prefetched data, it is readily accessible from cache and program execution is not stalled while waiting for it to be fetched from memory. Prefetched data is readily accessible in cache, that is, so long as it is going to be read. If it is to be modified by a subsequent instruction, then the program execution may have to be delayed while a cache unit goes to the bus to request exclusive ownership of shared data so that it can be modified by the local processor.
As alluded to above, shared memory regions are widely employed in present day computing systems to allow for communication of information between devices. In the case of a communication buffer between devices connected to a bus, the presence or absence of certain data in a data area (i.e., the designated area is set to one logical value, say all zeros or all ones) may indicate to one device that another device is ready to receive data. Alternatively, in a video buffer configuration, a microprocessor may start “painting” a background color for display to an operator by setting all of the video buffer entries to the same value. Both of the above examples require that particular data be stored to data entities within the data area in order to transfer information from one device or another. But once the data has been modified in cache, many applications require that the data be written back, or flushed, from cache to memory so that the other devices can access the data. Alternatively, once a microprocessor has finished modifying a large chunk of data within its local cache, say in the case of generating a portion of an image to display to an operator via a video buffer interface, then it is very advantageous in terms of cache management to flush the modified data from the cache out to the memory bus. Once the data has been written back to memory, the area that it formerly occupied within the local cache becomes available for caching of new data.
Most instruction set architectures provide instructions that allow a programmer to direct a conforming microprocessor to flush a single cache line. For example, the x86 instruction CLFLUSH directs an x86-compatible microprocessor to write back and invalidate contents of a single cache line. Alternatively, some instruction sets provide an instruction that enables a programmer to direct a conforming microprocessor to flush all of the modified lines in a cache to memory. The x86 instruction WBINVD directs an x86-compatible microprocessor to write back and invalidate all modified cache lines within a local data cache. Like the prefetch instruction, these two types of write back and invalidate instructions operate in a manner that is transparent to the flow of instructions through the execution stages of a microprocessor pipeline in that they are provided to cache instead of computation logic. Yet, cache units have a finite capability to post write operations to a memory bus. Once their write buffer is full of pending writes to the bus, then subsequent write operations must be stalled until space becomes available in the write buffer.
There are many operations performed by an application program are not restricted to data within a single cache line, nor are they so complex that their data fills an entire cache structure. Instead, many operations modify data structures that comprise multiple cache lines that occupy a significant portion of a local cache, but not the entire local cache. And to flush the data structures to memory, a programmer must employ a number of single-line flush instructions that is equivalent to the size of the portion of local cache that the data structures occupy. This could conceivably result in thousands of single-line flush instructions in a program flow. And because the data structures only occupy a significant portion of the cache, to flush all of the modified lines in the cache would be inefficient use of memory bus bandwidth. Hence, instructions like WBINVD are virtually never employed because of their devastating effects on program execution speed. To execute WBINVD would result in stalling subsequent pending store operations for an intolerable period of time.
Therefore, what is needed is an apparatus and method that enable a programmer to direct a microprocessor to write back and invalidate multiple cache lines via execution of a single instruction.