Providing ever faster microprocessors is one of the major goals of current processor design. Many different techniques have been employed to improve processor performance. One technique which greatly improves processor performance is the use of cache memory. As used herein, cache memory refers to a set of memory locations which are formed on the microprocessor itself, and consequently, has a much faster access time than other types of memory, such as RAM or magnetic disk, which are located separately from the microprocessor chip. By storing a copy of frequently used data in the cache, the processor is able to access the cache when it needs this data, rather than having to go "off chip" to obtain the information, greatly enhancing the processor's performance.
However, certain problems are associated with cache memory. One problem occurs when the data in the cache memory becomes misaligned with respect to the cache boundaries. Although many of the newer software compilers endeavor to avoid the problem of misalignment, nevertheless certain types of operations, such as the familiar COMMON statement in the FORTRAN programming language frequently causes cache misalignment, and in order to maintain complete software capability, a processor must have the ability to handle the misaligned cache data. The problem of misaligned data in cache memory is described in greater detail with respect to FIGS. 1A and 1B.
FIG. 1A is a diagram depicting the contents of a conventional cache memory, such as the cache memory used in the POWER PC family of processors available from IBM Corporation. As shown, the cache 100 contains a number of "cache lines", each cache line being 128 bytes wide. However, a maximum of 8 bytes may be read from the cache during any single access. As used herein, the term "word" shall refer to a four byte block of data and the term "double word" shall refer to an eight byte block of data. FIG. 1A shows a double word within cache line 0. The first word is xxab and the second word is cdxx, where a, b, c and d are desired bytes of data and "x" represents unneeded bytes of data. Conventionally, processors are designed to allow an n-bit wide transfer between the processor's execution units and the cache memory. For purposes of illustration, it will be assumed that the processor which accesses the cache shown in FIG. 1A allows a 32 bit, or one word, wide data transfer. Any word within any cache line of cache 100 may be retrieved by a single load instruction. Similarly, any word in any cache line may be written by operation of a single store instruction. If the processor requires the word containing the bytes a, b, c and d, it should be clear from the above that only a single load instruction is required to obtain all four bytes of data from the cache, since all of the required data resides in a single double word of the cache line.
Referring now to FIG. 1B, the same data is shown stored in the cache 100. However, this time it is misaligned in relation to the cache boundary. Specifically, it is seen that bytes a, b and c of the requested word are stored in cache line 0, but byte d is stored in cache line 1. Now the processor must make two accesses to the cache in order to obtain all four bytes of data. Moreover, since the data is coming back from the cache in two separate accesses, it must be reassembled before it is written into one of the processor's architected registers.
FIG. 1C is a schematic diagram of a conventional circuit for reassembling misaligned data returned from a cache access. The Circuit 300 is typically referred to as a load formatter. The formatter includes Formatter Control Logic 302 which provides the required control signal to operate the other components of the Circuit 300. Also included in formatter 300 is a rotator 304, Merge Latch 306 and a Multiplexor 308. The rotator 304 receives data from the cache, and depending on the signals received from the format control logic 302, arranges the data into eight byte blocks which can be shifted to any desired eight bit location in the rotator 304. In the present case, the bytes a, b and c are rotated to the left most position of the rotator then passed to the Merge Latch 306 which holds the data while the processor makes a second access to line 1 of the cache. When the processor accesses cache line 1, it retrieves byte d and passes it to the Rotator 304 which rotates it to the fourth byte position from the left. Afterwards, it is passed directly to multiplexor 308 along with bytes a, b and c from Merge Latch 306. In this way, the data is correctly reassembled and then passed to the architectural registers on the processor.
Superscalar processors achieve performance advantages over conventional scalar processors because they allow instructions to execute out of program order. In this way, one slow executing instruction will not hold up subsequent instructions which could execute using other resources on the processor while the slower instruction is pending.
However, misaligned accesses to the cache memory do not lend themselves to superscalar processing because of the possibility that the data may return from the cache out of order. Specifically, referring again to the example above, if for some reason the second load instruction completed before the first load instruction, then the data containing byte d would enter the formatter first followed by the data containing bytes a, b and c. In this case, when the data is reassembled, the order of the bytes would be incorrect. One solution to this problem is to prohibit misaligned cache access instructions from speculatively executing. In order words, when the superscalar processor recognizes that a misaligned access to the cache is about to occur, it ceases issue of instructions subsequent to the misaligned cache access instruction, and stalls while it waits for the instructions issued prior to the cache access instruction to complete. Then, it processes the two cache access instructions in order. In this way, the misaligned cache access is guaranteed to complete in order. Although this solves the above mentioned problem, it also reduces the processor's performance. It is thus one object of the invention to provide a superscalar processor which allows speculative execution of misaligned cache access instructions. Further objects and advantages of the present invention will become apparent in view of the following disclosure.