Pointer de-referencing has become a prevalent subject in today's software languages such as C++ and other object-oriented languages. The clear trend is to produce code in which structures are created dynamically. This has generated a problem of how to efficiently handle pointer de-reference operations at the microarchitectural level of a computer system. Prior approaches to solving this problem have attempted to enhance performance by focusing on improved data prefetching schemes.
Data prefetching is a subject that has been extensively explored recently as a way to improve processor performance. The basic idea behind prefetching is to load data references from external memory into an on-chip cache so that the memory latency is hidden. When data references are available in a local cache memory of the processor, program execution proceeds smoothly and rapidly. However, if data is not resident in the on-chip data cache, the processor must perform a bus cycle to access memory. This means that all of the dependent operations usually must have their execution postponed until the required data is returned from memory. Hence, prefetching is aimed at bringing data into a local cache to the processor prior to the time the data is actually needed.
Both hardware and software-based data prefetching schemes have been tried or proposed for reducing the stall time in a processor caused by memory latency. For example, an article entitled, "Effective Hardware-Based Data Prefetching for High-Performance Processors," by Tien-Fu Chen, et al. (IEEE 1995) describes a hardware-based prefetching mechanism that tracks data access patterns in a reference prediction table. Utilizing the history of previous code, the table is constructed based on addresses generated in prior iterations of an instruction pointer. Keeping track of data access patterns in this manner permits the address of the prefetch request to be calculated based upon the recorded history.
FIG. 1 illustrates a prior art approach of a reference prediction table 10 organized as an instruction cache for tracking previous reference addresses and associated strides for load and store instructions. In the computer arts, a stride is defined as the difference between the addresses of the two most recent accesses with the same instruction. Reference prediction table 10 records the effective address of the operand, computes the stride for an access, and sets a state controlling the prefetching by comparing the previously recorded stride with the one most recently computed. Thus, the predicted prefetch address is based upon the past history for the same memory access instruction.
The authors of the above paper report improved performance in the case of a constant or local stride, in situations where the stride is small, and also for scalar and zero-stride memory access patterns. Unfortunately, when the memory access pattern is irregular, the mechanism illustrated in FIG. 1 produces erroneous prefetches. This is a serious problem since irregular memory access patterns appear frequently in certain types of code (e.g., pointer de-referencing). In other words, for code that exhibits irregular memory access patterns the above described hardware-based prefetching scheme is useless.
Another data prefetching mechanism that relies upon recurrence-based classification of memory access patterns is described in a paper entitled, "Quantifying the Performance Potential of the Data Prefetch Mechanism for Pointer-lntensive and Numeric Programs," by Sharad Mehrota, et al. (dated Nov. 7, 1995). This paper describes the design of a prefetching mechanism which utilizes an indirect reference buffer (IRB) organized as two mutually cooperating sub-units; a recurrence recognition unit (RRU) and a prefetch unit (PU). In operation, the PU generates a prefetch using an indirect address stride computed after signaling by the RRU.
The problem with the foregoing IRB design, however, is that when a current operand access (e.g., a load) experiences a cache miss, the PU must wait idly until data is returned to the processor before it can generate the prefetch. The reason why is because the current low target register contents are not available to compose the prefetch address.
What is needed is a new type of data prefetching mechanism that offers an alternative to ordinary stride-based prefetching and sequential prefetching policies. As will be seen, the present invention introduces the novel concept of "global stride" prefetching that is advantageous for prefetching targets of memory de-reference operations (like those that typically occur in linked lists and other types of irregular code). This new hardware data prefetching policy reduces cache miss penalty and improves effective memory access speed.