1. Field of the Invention
This invention relates in general to the field of microelectronics, and more particularly to an apparatus and method that enable a programmer to direct a microprocessor to perform a prefetch operation that allocates and initializes a block of cache lines in its internal cache.
2. Description of the Related Art
In a present day microprocessor, the speed at which data can be transferred between internal logic blocks is an order of magnitude faster than the speed that external memory accesses exhibit. In an x86 desktop computer configuration, the bus which interfaces to system memory operates on the order of hundreds of megahertz while internal microprocessor clock speeds are approaching the tens of gigahertz. Consequently, a hierarchy of cache structures has evolved over more recent years to allow high-performance microprocessors to run at speed without having to execute transactions over a slow memory bus every time data is to be read or written.
An on-board, or local, cache in a pipeline microprocessor is a stand-alone unit in nature that operates in a manner that is transparent to the flow of instructions in the microprocessor pipeline to ensure that data which is going to be required by instructions in an application program is resident within the cache and can hence be accessed at pipeline speeds rather than at memory bus speeds. Cache architectures are varied in the art, some consisting of multiple levels of cache: a first level cache very near to a processor's execution logic, a second level cache which may be on-chip or off chip where less frequently accessed data are stored, third-level caches which may be on a memory card itself, and so on. Regardless of whichever architecture is employed, one skilled in the art will appreciate that the overall objective of a cache is to preclude the case where instructions in a microprocessor pipeline are stalled while a bus transaction over a slow memory bus is issued in order to obtain data that is needed by a pending read or write operation. When this happens, program execution comes to a grinding halt until the data arrives.
To complicate matters, present day computer systems also share regions of memory between devices. For instance, a primary microprocessor communicates with a communications microprocessor by reading data from and writing data to a specified region of memory. It is also very common to find areas of memory known as video buffers that are shared between a primary microprocessor and a microprocessor on a video card that is responsible for displaying data to an operator on a video device.
In an shared memory system, the possibility exists that data from a shared region can be in the local caches of two different microprocessors or other devices connected to a memory bus. And as long as all of the devices do nothing more than read the data, then no harm is done by allowing both of them to have the data in their local cache structures. But unpredictable consequences could result if they were both allowed to modify the data that is locally cached.
To prevent this from happening, system designers have developed cache consistency protocols that indicate that status of data in a cache. The most prevalent protocol is known as Modified, Exclusive, Shared, Invalid (“MESI”). Maintaining a local cache according to MESI ensures that two copies of the same data cannot be modified at the same time. A MESI shared state indicates to a local cache whether or not a particular block of data is shared. If it is shared, then the local processor cannot modify it without first conducting transactions over the slower memory bus to obtain exclusive permission to modify the data. To modify data, a processor must first obtain exclusive ownership of the data
The problem addressed by the invention described herein relates to the delay that is incurred in a program where it is desired to write data to memory. One skilled in the art will appreciate that a cache has no reasonable way of knowing when a particular area of memory will first be required, and so memory bus delays are always incurred when the particular area of memory is first loaded into local cache. Recognizing this fact, designers have developed a prefetch instruction that can be programmed for execution by a microprocessor. The prefetch instruction, however, does not operate on operands in the program flow. Rather, the prefetch instruction is directs a local cache to load operands from memory into the cache that will be used in the future. And since cache units operate with memory in a manner that is transparent to instruction flow in a microprocessor pipeline, it is prudent to issue a prefetch instruction prior to needing data so that the cache can fetch the data from memory prior to the time it is needed-in parallel with the execution of other instructions in the primary program flow. Then when a subsequent instruction comes along that needs access to the prefetched data, it is readily accessible from cache and program execution is not stalled while waiting for it to be fetched from memory. Prefetched data is readily accessible in cache, that is, so long as it is going to be read. If it is to be modified by a subsequent instruction, then the program execution may have to be delayed while a cache unit goes to the bus to request exclusive ownership of shared data so that it can be modified by the local processor.
As alluded to above, shared memory regions are widely employed in present day computing systems to allow for communication of information between devices. In the case of a communication buffer between devices connected to a bus, the presence or absence of certain data in a data area (i.e., the designated area is set to one logical value, say all zeros or all ones) may indicate to one device that another device is ready to receive data. Alternatively, in a video buffer configuration, a microprocessor may start “painting” a background color for display to an operator by setting all of the video buffer entries to the same value. Both of the above examples require that particular data be stored to data entities within the data area in order to transfer information from one device or another. And neither of these scenarios requires that the data from the data area be read prior to transferring the information. Consequently, devices within a shared memory computing system that transfer data to a shared memory region have no efficient means of progressing through cache to the memory bus-even if the shared memory region has been previously loaded into cache through a prefetch operation-because the data in cache may not be exclusively owned.
Therefore, what is needed is an apparatus and method that enable a programmer to direct a microprocessor to prefetch data in a state where it can be modified by a subsequent operation without having to resort to bus transactions to obtain exclusive ownership of the data.
In addition, what is needed is an apparatus and method that enable a programmer to direct a microprocessor to initialize data that has been exclusively prefetched.