The present invention relates generally to the field of microprocessor micro architecture.
Before the invention of caches, several machines implemented forms of dynamic scheduling in order to avoid stalling when a cache miss was encountered. The two most notable examples were the CDC 6600 with its scoreboard and the IBM 360/91 with its Tomasulo Algorithm, which were introduced in the late 1960""s. Dynamic scheduling, which entails rearranging the order of instructions in hardware during execution of the program while maintaining the semantics of the original program order, was found to be extremely complex, expensive, hard to debug, and hard to test. Therefore, during the 1970""s and 1980""s, no other dynamically scheduled machines were produced at IBM. Similarly, dynamic scheduling was also abandoned at CDC. Furthermore, dynamically scheduled processors were not produced by other manufacturers during that period.
Shortly after the introduction of the CDC 6600 and the IBM 360/91, computer systems using cache memory were developed. In those systems, as in modern computers, most memory accesses by a processor are satisfied by data in cache memory. Since the cache can be accessed much more quickly than main memory, the need for dynamic scheduling was also reduced.
In recent years, processor cycle times have decreased greatly, and the capacity of memory chips has increased significantly. But the access time of memory chips has changed little. This has led to an increasing gap between cache access times and main memory access times.
For example, in the late 1970""s, a VAX 11-780 would only slow down 50% if its cache was turned off and if it executed out of main memory. Today, main memory access times can be more than 100 cycles, and programs could slow down by more than 100 times if they fetched each instruction and data reference from main memory instead of cache. Even when an instruction or data reference is occasionally accessed from main memory, the small amount of cache misses can still greatly slow down program execution because of the long memory access times.
In order to reduce processor stalling when a cache miss is encountered, some microprocessor manufacturers have reintroduced dynamically scheduling in their processors in recent years. A dynamically scheduled processor will try to find other instructions that do not depend on the data being fetched from the missing load, and execute these other instructions out-of-order and in parallel with the cache miss. Significantly higher performance can thus be obtained.
Dynamically scheduled microarchitectures, analogous to the dynamically scheduled systems, are complex, have a large transistor count, a long design time, and long verification cycles. Therefore, there exists a need for a microarchitecture that reduces processor stalling when a cache miss is encountered, and that does not resort to a high complexity design.
An embodiment of the present invention is a processor that does not require the complexity of a dynamically scheduled microarchitecture, but is capable of achieving improved performance relative to a conventional statically-scheduled processor. The processor does not stall upon a data cache miss. Rather, the processor continues execution in a special Speculative Prefetching After data cache Miss (SPAM) mode when a cache miss is encountered. In the SPAM mode, the processor prefetches data and instructions not yet present in cache. When the initiating data cache miss is filled, the processor resumes execution in a normal mode. Some of the instructions that launched prefetches during SPAM mode may be executed again in normal mode. In this way, the processor can avoid or reduce stalling caused by data cache misses.
An embodiment that is described and shown includes normal mode register for use during normal mode execution, and SPAM registers for use during SPAM execution. The processor may further include two program counters, one for use during normal mode execution and another for use during SPAM execution. The processor may also include a SPAM cache for holding data during SPAM execution.
During normal mode execution, register writes occur to the normal mode registers. When a data miss occurs, in one embodiment, the normal mode program counter (PC) and the normal mode registers are copied to the SPAM program counter (SPAM PC) and the SPAM registers. Execution of the program then continues using the SPAM PC until the fetch for the data cache miss returns. The normal PC and the normal mode registers remain unchanged throughout SPAM execution. When the fetch for the data cache miss returns, normal mode execution using the normal mode registers and normal mode PC resumes.
According to an embodiment, a register file containing pairs of normal registers and SPAM registers laid out adjacently to each other. The normal mode registers are used during normal mode operations and the SPAM registers are used during SPAM mode operations. Special circuits of the processor copy the contents of the normal mode registers into the corresponding SPAM registers on an initiating cache miss. Another embodiment that is described and shown includes normal mode registers and SPAM registers that are held in separate register files. In this embodiment, the normal mode registers are not copied to the SPAM registers immediately after a data cache miss. Rather, the SPAM registers are updated on an instruction-by-instruction basis.