1. Field of the Invention
This invention relates in general to the field of microelectronics, and more particularly to an apparatus and method that enable a programmer to direct a microprocessor to rename a block of cache lines within in its internal cache.
2. Description of the Related Art
In a present day microprocessor, the speed at which data can be transferred between internal logic blocks is an order of magnitude faster than the speed that external memory accesses exhibit. For example, in an x86 desktop computer configuration, the bus which interfaces to system memory operates at speeds on the order of hundreds of megahertz while internal microprocessor clock speeds are approaching the tens of gigahertz. Consequently, a hierarchy of cache structures has evolved over more recent years to allow high-performance microprocessors to run at speed without having to stall while transactions are executed over a slow memory bus every time data is to be read or written.
An on-board, or local, cache in a pipeline microprocessor is a stand-alone unit by design that operates in a manner transparent to the flow of instructions in the microprocessor pipeline to ensure that data which is going to be required by instructions in an application program is resident within the cache and can hence be accessed at pipeline speeds rather than at memory bus speeds. Cache architectures vary in the art, some consisting of multiple hierarchical levels of cache: a first level cache very near to a processor's execution logic, a second level cache which may be on-chip or off chip where less frequently accessed data are stored, third-level caches which may be on a memory card itself, and so on. Regardless of which architecture is employed however, one skilled in the art will appreciate that the overall objective of a cache is to preclude situations where instructions in a microprocessor pipeline are stalled while waiting for bus transactions over a slow memory bus to complete in order to obtain data that is needed by a pending read or write operation. When this happens, program execution comes to a grinding halt until the data arrives.
To complicate matters, present day computer systems also share regions of memory between devices. For instance, a primary microprocessor communicates with a communications microprocessor by reading data from and writing data to a specified region of memory. It is also very common to find areas of memory known as video buffers that are shared between a primary microprocessor and a microprocessor on a video card that is responsible for displaying data to an operator on a video device.
In an shared memory system, the possibility exists that data from a shared region can exist concurrently in the local caches of two different microprocessors or other devices connected to a memory bus. And as long as all of the devices do nothing other than read the data, then no harm is done by allowing both of them to have the data in their local cache structures. But unpredictable consequences could result if they were both allowed to modify the data that is locally cached.
To prevent this from happening, system designers have developed cache consistency protocols that indicate that status of data in a cache. The most prevalent protocol is known as MESI. Maintaining a local cache according to MESI ensures that two copies of the same data cannot be modified at the same time. A MESI shared state indicates to a local cache whether or not a particular block of data is shared. If it is shared, then the local processor cannot modify the block of data without first conducting transactions over the slower memory bus to obtain exclusive permission to modify the data. To modify data, a processor must first obtain exclusive ownership of the data.
The problem addressed by the invention described herein concerns the delay that is incurred in an application program when data is written to memory. One skilled in the art will appreciate that a cache has no reasonable way of knowing when a particular area of memory will first be required, and so memory bus delays are generally experienced when the particular area of memory is first loaded into local cache. Recognizing this fact, designers have developed a prefetch instruction that can be programmed into an application program for execution by a microprocessor. The prefetch instruction, however, does not operate on operands in the program flow. Rather, the prefetch instruction directs a local cache to load operands from memory into the cache that will be required in the future by subsequent instructions. And since cache units interact with memory in a manner that is transparent to instruction flow in a microprocessor pipeline, it is prudent to issue a prefetch instruction prior to needing data so that the cache can fetch the data from memory before it is needed in parallel with the execution of other instructions in the primary program flow. Then when a subsequent instruction is executed that requires access to the prefetched data, the data is readily accessible from cache and program execution is not stalled while waiting for it to be fetched from memory. Prefetched data is readily accessible in cache, that is, so long as it is going to be read. If the data is to be modified by a subsequent instruction, then program execution may have to be delayed while a cache unit goes to the bus to request exclusive ownership of the data.
As alluded to above, shared memory regions are widely employed in present day computing systems to allow for communication of information between devices. One very common operation that is encountered in virtually all present day applications is known as a memory copy. A memory copy operation copies the contents of one memory location to another memory location. For example, a video buffer that is to be transferred for display is often built up from a number of smaller buffers. Contents of a first smaller buffer are configured for display and then copied to a second memory location that is, say, one of four screen quadrants. After the screen quadrant is configured, contents of its memory locations are copied to third memory locations where the entire screen contents reside. One skilled in the art will appreciate that generation of video buffer data is only one example of the many different tasks that application programs perform where data is copied from one location to the next.
Memory copy operations appear to be straightforward, but in view of the employment of hierarchical cache architectures within today's microprocessors, such operations actually become quite problematic. Consider that a first data structure, SRC, has been allocated and modified in a present day cache, then the operations necessary to complete a memory copy operation are 1) to allocate and secure exclusive ownership of a second data structure, DEST, to which the contents of SRC are to be copied, and 2) to modify the contents of DEST so that its contents match the contents of SRC. Yet, as has been described above, if the DEST has not been previously allocated in cache, or if DEST has been allocated, but its contents are shared, then to secure exclusive ownership of DEST requires that the application program be stalled while appropriate bus transactions are issued over the memory bus. This is a problem.
Moreover, the problem above is exacerbated in proportion to the size of the area of memory that is to be copied. To obtain exclusive ownership of, say, 100 cache lines will stall a program much longer than the time it takes to obtain exclusive ownership of a single cache line.
Therefore, what is needed is an apparatus and method that enable a programmer to direct a microprocessor to obtain exclusive ownership of a first block of cache lines and to copy data into the first block of cache lines from a second block of cache lines, where the data is copied in parallel with execution of subsequent programmed instructions.
In addition, what is needed is an apparatus and method that enable a programmer to direct a microprocessor write back data to memory from the first block of cache lines.