The present invention relates to data processing system design and in particular to accelerating cache memory performance.
A number of microprocessor design features are now considered essential in order to obtain high performance at low cost. For example, processor implementations now typically allow issuance of multiple instructions during each clock cycle. Most such processors also employ cache memories to provide a latency and bandwidth advantage for reasonably large data blocks such as on the order of one megabyte or more. The cache memories permit high speed pipelined execution to occur while minimizing delays associated with reading and writing data.
Cache memories operate by mirroring the contents of main memory in a way which is transparent to a Central Processing Unit (CPU). For example, each memory address referenced by an instruction is first passed to a cache controller. The cache controller keeps track of which portions of main memory are currently assigned to the cache. If the cache is currently assigned to hold the contents of the requested address, a xe2x80x9ccache hitxe2x80x9d occurs and the cache is enabled to complete the memory reference access, whether it be a write or a read access. If this is not the case, a xe2x80x9ccache missxe2x80x9d has occurred, and the main memory is enabled for access. When a miss occurs, the cache controller typically assigns the miss address to the cache, fetches or xe2x80x9cfillsxe2x80x9d the data contained at that address from main memory and stores it in the cache, and if necessary, displaces the contents of a corresponding cache location.
Cache memories are implemented in a hierarchy with a primary data store or main memory being the lowest order of the hierarchy, a secondary cache or backup cache (xe2x80x9cbcachexe2x80x9d) being a middle level of the hierarchy, and a primary level cache or xe2x80x9cdcachexe2x80x9d being the highest level cache. The bcache, for example, may be a xe2x80x9cboard levelxe2x80x9d cache implemented with memory chips external to the processor chip and the dcache may be implemented with on-chip memory devices.
It is desirable for the physical existence of the various cache hierarchy levels to be transparent. For example, the programmer should only have to worry about implementing instructions and not be concerned with the details of whether a particular target address is located in the dcache, bcache, or main memory. Furthermore, the programmer should be permitted to assume that the data written back to memory by a store instruction (STx) will always be written back properly. This important property of cache hierarchies is known as cache coherency.
In general, cache memories consist of a tag portion in addition to a data storage portion. The tag portion contains address and status bits for the data contained in the storage portion. The data portion contains typically multiple data bytes for each addressable cache location.
To complete an instruction reference to a cache memory, the data and tag memories are first read. If the referenced address matches the address in the tag portion, a hit occurs and then data associated with the tag is delivered to the consuming instruction. If the tags do not match, the data referenced by the consuming instruction must then be fetched and written into the cache. Cache filling is an operation by which the contents of the cache are copied back to main memory, and must typically be performed prior to displacing a xe2x80x9cvictimxe2x80x9d cache location with the new data in order to avoid losing the contents of the victim location. It is therefore common to include a so-called xe2x80x9cdirtyxe2x80x9d bit with each cache location, indicating whether the data for the cache location is different from the corresponding data in the next lower level of the hierarchy.
After the victim block has been written, but before the memory fill for the new block may proceed, the tag array still contains the victim address. During this period of time, if the same location is again accessed by another outside agent such as a second processor, the cache might provide a false hit response. One way of dealing with this problem is to allow this false hit response to occur, but then depend upon the fact that the data in the cache is the same as the data in the main memory until the memory fill updates the processor. This assumption is valid when the processor and the cache use a shared data bus. For example, in most known computing system architectures, the caches and main memory typically share a data bus. Therefore, the necessary fill operations may be completed for each level of the cache simultaneously.
A complication for cache management occurs if the system architecture permits sharing of write access to main memory locations among processors. Probe commands are therefore typically used in such architectures to allow one processor to inform another processor that it is attempting to write to a particular location. This allows the first processor to properly execute the store conditional instructions. However, the need to support such probe commands requires each processor to be able to determine whether it presently has the only valid contents of a main memory location in one or more of its own caches.
In the present invention, the primary level cache (dcache) and second level cache (bcache) do not share a common bus for access to the main memory. Rather, the dcache is provided with two different data buses to separately access the main memory and the bcache in order to provide higher bandwidth to each of these structures. In this case, a memory fill operation from the main memory may be consumed directly by the dcache without first having to wait for a fill operation in the bcache to complete.
Since the bcache bus is normally a high speed pipelined read bus, this avoids necessarily turning around the bus in order to update the bcache during the pendency of some other critical operation. Otherwise, one would have to wait for the bcache pipeline to drain, initiate the store operation, wait for the pipeline to drain again, and then turn the pipeline back around for subsequent read operations.
While this architecture improves processor performance by allowing the higher speed dcache memory to complete a fill operation without waiting for the slower bcache memory, there is a problem in that strict cache hierarchy rules are violated. In effect, the rules of cache hierarchy coherency are temporarily xe2x80x9cbypassedxe2x80x9d in the sense that the bcache is not immediately updated with the fill data. Thus, the bcache may not be updated for a long period of time and there is no guarantee that xe2x80x9cstalexe2x80x9d data in the bcache is the same as that in main memory.
A simple solution to this problem might be to either invalidate the bcache tag on all bcache victim operations, or otherwise to insure that all fill operations are cycled through the bcache, i.e., disable the bypassing mode. However, these approaches either consume precious bcache tag or dcache memory bandwidth.
Thus, while a processor according to the present invention uses two independent memory access ports for the bcache memory and the main memory, a set of rules are also observed by the processor to enable it to infer the bcache state without unnecessarily performing bcache reads.
In accordance with the invention, upon the issuance of a memory reference instruction such as a load or store instruction, the dcache memory array is first checked to see if it has the contents of the referenced location as is typical. If there is a hit in the dcache, then the memory access is complete.
However, if a dcache miss occurs, then a bcache read is initiated. In the process of reading data from the bcache, if it becomes apparent that a dcache victim operation will be required, i.e., that the dcache is already full and dcache locations will need to be displaced in order to copy new information from the bcache to the dcache, a determination is first made as to whether or not the dcache victim block is dirty. If the dcache victim block is dirty, this block must be scheduled for eviction either to the bcache or to main memory. If an index portion of the memory reference location is not equal to the index portion of the dcache victim block, then the dcache victim block should be evicted to the bcache.
If the index portions do match (this is called a xe2x80x9csubset matchxe2x80x9d) and the old dcache block was dirty, then the block should be scheduled for eviction to main memory. In particular, it can be inferred in this instance that the particular bcache is stale as having been bypassed on a previous fill operation. In other words, if the bcache index block is the same as the referenced address, then the processor infers that the two data blocks are attempting to reside in the same location in the bcache, the processor infers that the copy in the bcache is stale. Thus, the victim dcache data should be written back directly to main memory, bypassing the bcache.
Continuing with the bcache read operation, a tag lookup for the referenced memory location is performed in the bcache tag array and a bcache memory fill to the dcache is allowed to proceed if the tags match. However, the dcache victim block is evicted directly to main memory in this instance as the processor has time to complete the eviction process, such as through a victim buffer. If, in this instance, the bcache tag is dirty and the dcache has not already been evicted to main memory in the prior steps, an inference can be made that the bcache contents are not stale. The bcache victim block must therefore also be moved back to main memory.
If the lookup in the bcache tag array did not produce a match, then it will be necessary to fetch the data from main memory. In this instance, the referenced address is placed in a miss address file (MAF) and the fill from the main memory to the dcache proceeds directly. During this process, if the victim dcache block was dirty, then it needs to be evicted back to main memory by placing it in the victim buffer and extracting it as the processor has time. Once the victim block has been removed to main memory, then the referenced address is removed from the miss address file.
The miss address file provides additional assurance that stale bcache data will not be used. In particular, upon a subsequent subset match between a referenced location and an address in the miss address file, the memory reference is not allowed to proceed until the miss address file is cleared. In the event of an external probe operation, a memory lock response will be provided until the miss address file is cleared.
It can now be understood how the present invention allows for an architecture which splits the memory buses and maintains cache hierarchy consistency without performing an explicit invalidation of the bcache tag. Two explicit rules are used to determine the status of a block read from the dcache. In particular, if any memory reference subset matches a block in the dcache, the associated bcache block is ignored. Secondly, if any memory reference subset matches a block in the miss address file, the associated bcache block is ignored. Therefore, any further load store references which subset match the first reference are not allowed to proceed until the fill back to main memory has been completed and the associated miss address file entry has been retired. This ensures that no agent in the host processor or an external agent can illegally use the stale bcache data.
An additional complication comes from the fact that a second processor in a multiprocessor arrangement may issue probe commands. In response to such a probe command, the first processor must check to see if it is in the process of accessing the data. Normally, this access operation is executed by looking in the contents of the bcache. However, in an instance where the memory data buses are split, the processor must not only consume cycles to check the bcache, but also consume different caches in order to determine if an address is locked in the dcache. Therefore, what is needed is a technique for allowing the processor to infer the bcache state not only for its internal operations, but also for optimized response to external probe commands.
The present invention also provides an elegant solution in this instance as well. In particular, memory references generated by probe commands follow the same process flow except that they do not generate victim transactions (i.e., probe commands are simply requested to determine whether or not a location has been locked and do not attempt to write the location).