The present invention generally relates to computer processor operations and architectures. More particularly the present invention relates to performance optimization by speculatively pre-fetching and pre-flushing data in a processor system in which instructions may be executed out of order.
A high performance processor, e.g., a super-scalar processor in which two or more scalar operations are performed in parallel, may be designed to execute instructions out of order, i.e., in an order that is different from what is defined by the program running on the processor. That is, in this high performance processor system, instructions are executed when they can be executed rather than when they appear in the sequence defined by the program. Typically, after the out of order execution of instructions, the results are ultimately reordered to correspond with the proper instruction order, prior to passing the results back to the program running on the processor.
Examples of processor architectures that execute instruction out of order are described in U.S. Pat. No. 5,758,178 (issued May 26, 1998, and entitled xe2x80x9cMiss Tracking System and Methodxe2x80x9d), U.S. Pat. No. 5,761,713 (issued Jun. 2, 1998, and entitled xe2x80x9cAddress Aggregation System and Method for Increasing Throughput to a Multi-Banked Data Cache From a Processor by Concurrently Forwarding an Address to Each Bankxe2x80x9d), U.S. Pat. No. 5,838,942 (issued Nov. 17, 1998, and entitled xe2x80x9cPanic Trap System and Methodxe2x80x9d), U.S. Pat. No. 5,809,275 (issued Sep. 15, 1998, and entitled xe2x80x9cStore-to Load Hazard Resolution System and Method for a Processor that Executes Instructions Out of Orderxe2x80x9d), U.S. Pat. No. 5,799,167 (issued Aug. 25, 1998, and entitled xe2x80x9cInstruction Nullification System and Method for a Processor that Executes Instructions Out of Orderxe2x80x9d), all to Gregg Lesartre who is one of the present inventors, assigned to the present assignee, and all of which are expressly incorporated herein by reference in their entireties.
As described in more detail in, e.g., U.S. Pat. No. 5,758,178 (""178), an out of order execution processor system may include one or more processors, each having a memory queue (MQUEUE) for receiving and executing instructions that are directed to memory accesses to the cache memory (DCACHE) or the memory hierarchy. The MQUEUE includes a plurality of instruction processing mechanisms for receiving and executing respective memory instructions out of order. Each instruction processing mechanism includes an instruction register for storing an instruction and an address reorder buffer slot (ARBSLOT) for storing the data address of the instruction execution results. Significantly, dependent-on-miss (DM) indicator logic in each ARBSLOT prevents a request from its respective ARBSLOT to the memory hierarchy for miss data that is absent from the DCACHE when another ARBSLOT has already requested from the memory hierarchy the miss data.
In particular, for example, FIG. 1 shows a block diagram of the relevant portions of the computer system for illustrating the operation of the instruction processing mechanism 39b portion of the MQUEUE. The MQUEUE includes one or more ARBSLOTs 48 (only one of which is shown). When an ARBSLOT 48 requests a cache line from the DCACHE 24, the ARBSLOT 48 asserts signal ACCESS_REQ 115 accompanied with an address ACCESS_ADDR 114. In the event that there is a potential hit in the DCACHE 24, the status indicator 82 (or status indicators if the cache is associative) will reflect a valid cache line or lines. Further, the tag compare mechanism 108 reads the tag DCACHE_TAG(s) 81 and compares it to the tag ACCESS_TAG 116 associated with the access address ACCESS_ADDR 114. When there is a match, the tag compare mechanism 108 concludes that there is a hit and deasserts the signalxcx9cHIT 118 to indicate a hit, which causes the ARBSLOT 48 to mark itself done. The result of the operation is held in a rename register (not shown) until the instruction retires, when it is moved to an architectural register (not shown).
When the cache access results in a cache miss, e.g., based upon a status indicator 82 indicating an invalid cache line(s), or alternatively, when the tag DCACHE_TAG(s) 81 does not match the tag ACCESS_TAG 116, then the tag compare mechanism 108 asserts the xcx9cHIT signal 118 to indicate a miss to the ARBSLOT 48. Assuming that this is the first ARBSLOT 48 to attempt to access this miss data line, the DM indicator logic 135 causes the miss request signal MISS_REQUEST 111 to be issued to the miss arbitrator 107. The miss arbitrator 107 arbitrates by prioritizing the various miss requests that can be generated by the various ARBSLOTS 48. Eventually, the miss arbitrator 107 issues a signal MISS_GRANTED 112 to grant the miss request. This signal is sent to the ARBSLOT 48, which in turn asserts the miss control signal MISS_CAV signal 101 to the system interface control 102. The system interface control 102 in turn makes a memory request to the memory hierarchy (not shown) for the data line based upon the address MISS/COPY_IN ADDR 104 that is forwarded from the ARBSLOT 48 to the system interface control 102.
Once the data line is transferred from the memory hierarchy to the system interface control 102, the system interface control 102 passes the data line to the DCACHE 24, as indicated by reference arrow 105, asserts the control signal COPY_IN to the DCACHE 24, and issues the status bits to the DCACHE 24. Simultaneously, the system interface control 102 asserts the control signal COPY_IN 103 to the ARBSLOTs 48 and places the associated address on MISS/COPY_IN ADDR 104 to the ARBSLOTs 48.
If another ARBSLOT 148 attempts to access the DCACHE 24 for a miss data line that is currently being requested from memory hierarchy, then the particular ARBSLOT 48 will be advised by the status indicator 82, as the status indicator 82 will indicate a miss pending status, or that the cache line is being requested by another ARBSLOT 48. Thus, a redundant memory request for a data line that has already been requested is avoided. A more detailed description of the memory queue (MQUEUE) and the DM indicator 135 may be found in the above listed U.S. patents, e.g., the ""178 patent.
While modern day high performance processors, e.g., a super-scalar processor described above, have improved greatly in the instruction execution time, slow memory access time is still a significant impediment to a processor running at its full speed. If requests for data can be fulfilled from the cache memory, delays associated with an access to the slower memory hierarchyxe2x80x94usually referred to as a cache miss latencyxe2x80x94can be avoided. Thus, reducing the number of cache misses is a goal in high performance processor designs.
Moreover, in a multi-processor systems, whenever a processor requests a data line, a coherency check is required to determine if respective caches of the other processors contain the requested data line, and/or whether a writing back (or flushing) of the data line to the memory hierarchy is required, e.g., when the data line was modified by the particular processor that owns the data line. The coherency check adds delays to memory accessesxe2x80x94referred to herein as coherency check latencyxe2x80x94.
Speculative pre-fetching and pre-flushing are based on a well known locality theory, called the spatial locality theory, which observes that when information is accessed by the processor, information whose addresses are nearby the accessed information tend to be accessed as well. This is particularly true when the load or store operation that caused the cache miss is a part of an instruction code sequence, which is accessing a record length longer than a cache line, i.e., when the instruction code sequence references data that spans over multiple data lines. In a system utilizing pre-fetching and/or pre-flushing, rather than fetching (and/or flushing) only currently accessed data into (or from) the cache memory, a block of data (or one or more cache lines) in the vicinity, including the currently accessed data, may be brought into (and/or flushed from) the cache memory. This speculative pre-fetching and pre-flushing of extra data lines into (or from) the data cache before it is required by later memory reference instructions may hide at least some of the cache-miss latency and the coherency check latency, and thus improve the overall performance of the processor system.
Unfortunately, however, heretofore, no known solutions for implementing pre-fetching and/or pre-flushing data lines in processors that perform out of order execution of instructions exists. In a system employing a speculative pre-fetching and/or pre-flushing described above, each additional memory request resulting from an out of order execution of instructions involves a memory transaction that requires transfer of a number of data lines (rather than a single data line without the pre-fetching or pre-flushing of extra data line(s)), and may result in an even greater increased traffic across the system bus, may exacerbate the excessive utilization of the system interface bandwidth, and thus may compromise system performance.
Thus, what is needed is an efficient system for and method of pre-fetching one or more data lines from a memory hierarchy to a cache memory without compromising the system performance of an out of order processing system.
What is also needed is an efficient system and method for prefetching one or more data lines from memory hierarchy to a cache memory while minimizing redundant multiple memory requests in the event of a cache miss in an out of order processing system.
What is also needed is an efficient system for and method of pre-flushing one or more data lines from a cache memory in a multiple out-of-order instruction execution processors system without adding to the system complexity, and thereby minimizing coherency check latency of the system.
In accordance with the principles of the present invention, an apparatus for minimizing cache coherency check latency in an out of order instruction execution system having a plurality of processors comprises at least one cache coherency check mechanism associated with a first one of the plurality of processors, the at least one cache coherency check mechanism being configured to output a presence signal indicating that a first data line being requested by a second one of the plurality processor is present in a cache memory associated with the first one of the plurality of processors, at least one pre-flush slot configured to, upon receipt of the presence signal, determine at least one additional data line to be pre-flushed from the cache memory associated with the first one of the plurality of processors to the memory hierarchy; and a logic associated with the at least one pre-flush slot, the logic configured to provide an indication whether the at least one additional data line is already being flushed to the memory hierarchy from the cache memory.
In addition, in accordance with another aspect of the principles of the present invention, a method of minimizing cache coherency check latency in an out of order instruction execution system having a plurality of processors comprises detecting a request for access to a first data line from a memory hierarchy, the request being made by a first one of the plurality of processors, determining whether the first data line is present in a cache memory associated with a second one of the plurality of processors, calculating an address of at least one additional data line to be pre-flushed from the cache memory to the memory hierarchy, and determining whether a previously made request for the at least one additional data line from the cache memory is pending.