1. Field of the Invention
The present invention relates in general to cache management in microprocessors and, more particularly, to a system, method, and mechanism for instruction cache block invalidation.
2. Relevant Background
Computer programs comprise a series of instructions that direct a data processing mechanism to perform specific operations on data. These operations including loading data from memory, storing data to memory, adding, multiplying, and the like. Data processors, including microprocessors, microcontrollers, and the like include a central processing unit (CPU) comprising one or more functional units that perform various tasks. Typical functional units include a decoder, an instruction cache, a data cache, an integer execution unit, a floating point execution unit, a load/store unit, and the like. A given program may run on a variety of data processing hardware.
As used herein the term xe2x80x9cdata processorxe2x80x9d includes complex instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. A data processor may be a stand alone central processing unit (CPU) or an embedded system comprising a processor core integrated with other components to form a special purpose data processing machine. The term xe2x80x9cdataxe2x80x9d refers to a digital or binary information that may represent memory addresses, data, instructions, or the like.
In response to the need for improved performance several techniques have been used to extend the capabilities of these early processors including pipelining, superpipelining, and superscaling. Pipelined architectures attempt to keep all the functional units of a processor busy at all times by overlapping execution of several instructions. Pipelined designs increase the rate at which instructions can be executed by allowing a new instruction to begin execution before a previous instruction is finished executing. A simple pipeline may have only five stages whereas an extended pipeline may have ten or more stages. In this manner, the pipeline hides the latency associated with the execution of any particular instruction.
The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors. Most processors use a cache memory system to speed memory access. Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed instructions and data, designed to speed up subsequent access to the same data. Cache may be implemented as a unified cache in which data and instructions are cached together, or as a split cache having separate instruction and data caches.
Cache technology is based on a premise that programs frequently reuse the same instructions and data. When data is read from main system memory, a copy is also saved in the cache memory. In the case of an instruction, subsequent requests for instructions are checked against the cache to see if the information needed has already been stored. If the instruction had indeed been stored in the cache, it is delivered with low latency to the processor. If, on the other hand, the data had not been previously stored in cache then it is fetched from main memory and also saved in cache for future access.
A feature of program instructions is that they often exhibit xe2x80x9cspatial localityxe2x80x9d. Spatial locality is a property that information (i.e., instructions and data) that is required to execute a program is often close in address space in the memory media (e.g., random access memory (RAM), disk storage, and the like) to other data that will be needed in the near future. Instructions tend to have higher spatial locality than data. Cache designs take advantage of spatial locality by filling the cache not only with information that is specifically requested, but also with additional information from addresses sequentially adjacent to the currently fetched address. In this manner if the sequentially adjacent instructions are actually needed, they will already be loaded into cache.
In a split cache or xe2x80x9charvard architecturexe2x80x9d cache it is necessary to maintain coherency between the instruction and data caches. In this type of architecture the instruction cache is usually optimized for read operations and has little support for write operations as most implementations do not allow writes to the instruction cache. As a result, the content of the instruction cache can get out of sync with the data cache and main memory when the program performs a store operation into the address space occupied by the program. This occurs in self-modifying code, for example.
One solution to this problem is to define special instructions or special instruction sequences, or both that maintain the instruction cache coherency. These instructions and instruction sequences function to discard or invalidate portions of the cache that are inconsistent and to explicitly synchronize the instruction cache with other instructions. Generally such instructions must be handled carefully by software. All instructions subsequent to an instruction cache block invalidate (ICBI) instruction must be assured that the preceding ICBI instruction has completed. In prior solutions the only way to assure completion was to serialize the ICBI execution (i.e., executed each ICBI by itself in a pipeline) so that the ICBI was committed to the instruction cache before a subsequent instruction was issued to the pipeline. As a result of serialization, each ICBI consumed multiple pipeline cycles before a subsequent instruction was issued. Such restrictions reduce instruction throughput and can significantly affect processor performance in cases where an instruction is changed by a previous instructions or new instructions are brought in from external sources. It is desirable to implement instruction cache invalidate instructions and cache synchronization instructions using existing hardware in an efficient manner that also avoids a need to serialize the instructions.
The present invention involves a processor having an execution pipeline. A cache memory includes a plurality of cache blocks with instruction words held in selected ones of the cache blocks. An ICBI address buffer is provided for holding addresses of instruction cache blocks to be invalidated by ICBI instructions pending in the processor""s execution pipeline. An instruction cache controller coupled to the cache memory generates cache accesses to invalidate specified cache blocks in response to receiving buffered addresses from the ICBI address buffer. Preferably the cache accesses serve to commit ICBI instructions to the instruction cache asynchronously with respect to the processor""s execution pipeline.
In a particular example, the execution pipeline includes a fetch stage, a decode stage, one or more execution stages, and a writeback stage. The fetch unit is also coupled to receive interim results generated by the execution stages from a result bus. A decode unit obtains instructions fetched by the fetch unit and can detect an ICBI instruction. The decode unit notifies the fetch unit upon detection of an ICBI. At least one execution unit implements the decoded ICBI, determines an address identifying the cache block to be invalidated and places the address on the result bus. The ICBI address buffer is coupled to the result bus and stores the determined addresses for one or more pending ICBI instructions.
In another aspect the present invention involves a cache synchronization technique in which one or more instruction cache block addresses are buffered where each buffered address is associated with a pending ICBI requests. A synchronization instruction (SYNCI) is executed following the pending ICBI instructions. In response to the SYNCI instruction the processor prevents instructions following the SYNCI from being executed until the pending ICBI instructions are committed to the instruction cache. In this manner, the instructions following the SYNCI are not exposed to the incomplete state created by the pending, uncommitted ICBI instructions. In response to the SYNCI instruction the processor determines when all pending ICBI instructions are committed then restarts execution of instructions following the SYNCI.
In another aspect the present invention involves a method for operating an instruction cache. A plurality of instruction words are loaded into specified blocks in a cache, each block identified by an address and each block being identified as valid or invalid. An instruction cache block invalidate (ICBI) instruction is executed to mark a specified one of the cache blocks as invalid. While the execution is pending, the target address of the ICBI is buffered. The ICBI is considered complete when the target address is buffered. The target address is invalidated in the instruction cache asynchronously with respect to the execution pipeline using the buffered target address.
The foregoing and other features, utilities and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention as illustrated in the accompanying drawings.