Providing ever faster microprocessors is one of the major goals of current processor design. Many different techniques have been employed to improve processor performance. One technique which greatly improves processor performance is the use of cache memory. As used herein, cache memory refers to a set of memory locations which are formed on the microprocessor itself, and consequently, has a much faster access time than other types of memory, such as RAM or magnetic disk, which are located separately from the microprocessor chip. By storing a copy of frequently used data in the cache, the processor is able to access the cache when it needs this data, rather than having to go "off chip" to obtain the information, greatly enhancing the processor's performance.
However, certain problems are associated with cache memory. In particular, a great problem exists when multiple processors are employed in a system and need the same data. In this case, the system needs to ensure that the data being requested is coherent, that is valid for the processor at that time. Another problem exists when the data is stored in the cache of one processor, and another processor is requesting the same information.
Superscalar processors achieve performance advantages over conventional scalar processors because they allow instructions to execute out of program order. In this way, one slow executing instruction will not hold up subsequent instructions which could execute using other resources on the processor while the stalled instruction is pending.
In a typical architecture, when an instruction requires a piece of data, the processor goes first to the onboard cache to see if the data is present in the onboard cache. Some caches have two external ports, and the cache can be interleaved. This means that, for example in FIG. 1, a cache 100 has two cache banks, 140 and 130. One cache bank could be for odd addresses and the other cache bank would then be for even addresses.
Internally, each cache bank 140 and 130 cache has an internal input port (not shown) to which address information of a cache request is made. In FIG. 1, the data for address A1 is stored on cache line 110 in cache bank 130, and the data for address A2 is stored on cache line 120 in cache bank 140. Cache 100 has two external ports for input data, port 180 and port 190.
Cache request 1 shows a cache request for an instruction 1 (not shown), and Request 2 shows a cache request for instruction 2 (not shown). Instruction 1 is an older instruction than instruction 2, meaning it should be executed before instruction 2. If a superscalar processor has multiple load units, such as in the PowerPC.TM. processor from IBM Corporation, Austin, Tex., then both instructions could make a cache request at the same time. In the example shown, both instruction 2 and instruction 1 are attempting to access data at address A1, and have submitted cache requests to cache 100 to do so.
Since bank 130 only has one internal input port, both cache requests cannot be processed at the same time. This is due to the interleaved nature of cache 100.
FIG. 2 shows what happens when cache request 2 accesses cache bank 130 before cache request 1. Cache request 2 hits in cache bank 130 for the data it needs. However, cache request 1 cannot access cache bank 130 until at least the next cycle. Thus, newer instruction 2 can get the data it needs before older instruction 1 can. Newer instruction 2 can complete before older instruction 1 in this case because of this port allocation conflict.
The same ordering problem can occur when an older instruction misses in the cache, and a newer instruction hits. A miss occurs when the address of the data cannot be found in the memory management unit, and the memory management unit must then request that the data be brought from higher memory. A hit occurs when both the address of the data and the data are accessible through the memory management unit and the cache, and this data can be output to an instruction waiting for it.
A cache miss with an older instruction followed by a cache hit by a newer instruction, both attempting to access the same data, can occur when the real address of the data is represented by two different effective addresses. When the effective address requested by the newer instruction and its data are already accessible by the memory management unit and the cache, and where the older instruction address and data is not accessible in the memory management unit and the cache, this also leads to a situation where a newer instruction accessing the same data as an older instruction can complete before the older instruction.
In multi-processor systems, a cache miss in one processor may trigger a "snoop" request to the other processors in the system. This snoop request indicates to the other processors that the data being "snooped" is being requested by another processor, and the other processors should determine whether the address being sought resides in their own cache. If it is, the main memory data should be made coherent, that is updated to reflect the correct current state of the system state.
In terms of superscalar architecture, this problem is compounded by the fact that any loads may be finished out of order, or in other words, a newer instruction may be marked for completion before an older one. That is, a newer instruction may be marked as set to execute before an older one is. Thus, two load instructions may address the same cache location, and the newer instruction may actually be furnished with a piece of data before the older instruction. Thus, the newer instruction be marked for completion out of order possibly causing false data to be used in the completion of the instruction. When a later load instruction bypasses an earlier load instruction, the earlier load instruction may get newer data than it should have received based on the original program order.
Previous solutions to this coherency problem include the one detailed in U.S. patent application, Ser. No. 08/591,249 filed Jan. 18, 1996, now U.S. Pat. No. 5,737,636, entitled A Method and System for Bypassing in a Load/Store Unit of a Superscalar Processor. In this solution, a Load Queue held a page index and a real address along with a an ID and a valid bit. The ID indicated the program order of the load instruction.
In addition to the aforementioned entries, the Load Queue entry also held a modified field which indicates whether the cache line entry for the address has been modified. When a cache access, such as a store instruction or a snoop request, indicates that the cache line may have been modified, the Load Queue is searched. If it contains an entry for the same line, the modified bit is set to indicate a possible modification.
Any subsequent load would perform a comparison of the Load Queue entries. If the same line is pending in the Load Queue and marked as modified, the ID field is checked. If the current line is older than that which was pending and modified, the pending loads in the Load Queue are canceled and re-executed after the subsequent load. This avoids the problem of having the older load get newer data than the newer load.