1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture providing in-line bank conflict detection and resolution in a multi-ported non-blocking cache.
2. Relevant Background
Modern processors, also called microprocessors, use techniques including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution to enable multiple instructions to be issued and executed each clock cycle. As used herein the term processor includes complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors, however. Most processors use a cache memory system to speed memory access.
Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. Cache technology is based on a premise that programs frequently re-execute the same instructions. When data is read from main system memory, a copy is also saved in the cache memory, along with an index to the associated main memory. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache (i.e., a "hit"), the data is delivered immediately to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache (i.e., a "miss") then it is fetched directly from main memory and also saved in cache for future access.
Typically, processors support multiple cache levels, most often two or three levels of cache. A level 1 cache (L1 cache) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is the fastest (i.e., lowest latency) because it is accessed by the internal components of the processor. On the other hand, off-chip cache is an external cache of static random access memory (SRAM) chips plugged into a motherboard. Off-chip cache has much higher latency, although is typically much shorter latency than accesses to main memory.
Given the size disparity between main system memory (which may be tens or hundreds of megabytes) and cache memory (which is typically less than one megabyte), certain rules are used to determine how to copy data from main memory to cache as well as how to make room for new data when a cache is full. In direct mapped cache, the cache location for a given memory address is determined from the middle address bits. In other words, each main memory address maps to a unique location in the cache. Hence, a number of different memory addresses will map to the same cache location. In a fully associative cache, data from any main memory address can be stored in any cache location. Each cache line is indexed by a "tag store" that holds a "tag" generated, for example, by hashing the memory address that it indexes. All tags are be compared simultaneously (i.e., associatively) with a requested address, and if one tag matches, then its associated data is accessed. This requires an associative memory to hold the tags that makes this form of cache expensive.
Set associative cache is essentially a compromise between direct mapped cache and a fully associative cache. In a set associative cache, each memory address is mapped to a certain set of cache locations. An N-way set associative cache allows each address to map to N cache locations (for example, four-way set associative allows each address to map four two different cache locations). In other words, in a four-way set associative cache, each tag maps to four possible cache locations in a set. Upper address bits in the requested address will uniquely identify which item in the set the tag is referencing.
Superscalar processors achieve higher performance by executing many instructions simultaneously. These instructions generate multiple numbers of memory loads or stores per cycle. Conventional processors use several techniques to allow coherent and parallel access to the cache and memory hierarchy. One technique, used commonly at the lowest level of cache access, provides duplicate copies of the cache. This technique is most effective with read only cache so as to avoid a problem with maintaining consistency between both cache copies. Each cache copy doubles the real estate consumed as compared to a single cache copy. Increased size tends to limit clock speeds. Hence, this technique is limited to small caches and typically enables only two cache copies and so two accesses per cycle.
Another technique involves using high speed circuitry to allow two or more accesses per processor clock cycle. This approach assumes that the processor clock is sufficiently slow that the cache clock can be increased. In practice, however, the processor performance demands force the processor clock to be increased such that the ratio of processor clock to cache clock fails to allow significant advantage to this technique.
A similar technique is to provide multiple banks with each bank serving a particular main memory address. While this technique is adaptable to larger cache sizes, it too has limited scaleability. Multi-bank caches, like duplicate caches, tend to limit clock speeds. Multiple banks are successfully used to enable multiple accesses per clock cycle, but have performance limits caused by addresses conflicts. Address conflicts arise when two cache accesses are attempting to access the same bank. What is needed is an effective means to limit the opportunity for address conflicts. Further, a need exists for efficient address conflict detection as well as a method and apparatus for resolution of address conflicts once detected.