1. Field of the Invention
The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method for implementing access speculation predictors that perform predictions based on a domain indicator of a cache line storing data for a memory region targeted by a data request.
2. Background of the Invention
Processors in a multi-processor computer system typically share system memory, which may be either in multiple private memories associated with specific processors, or in a centralized memory, in which memory access is the same for all processors. For example, FIG. 1 illustrates a multi-processor computer 100 utilizing a centralized memory system sometimes referred to as a “dance hall,” in which processors 102 are on one “side” of a data bus 116 and system memories 114 are on the other “side” of the data bus 114. When a processor, such as processor 102a requires data from memory, it first checks its own L2 cache 106a, which is inclusive of the L1 cache 104a with regard to such snoops. If the data is not in either local cache, then a request is put out onto data bus 116, which is managed by bus arbiter 110. Cache controllers 108 “snoop” data bus 116 for requests for data that may be in their respective caches 106 or 104.
If no valid data is in any of the caches, then the data is retrieved from one of system memories 114, each being assigned a particular range of memory addresses, which are under the control of respective memory controllers 112. If speculation is not performed, then before a specific memory controller 112 accesses data from its respective system memory 114, the memory controller 112 waits until a combined response is returned to the data bus 116 by the bus arbiter 110 stating that the source of the valid data is the system memory.
Referring now to FIG. 2, a time line 200 illustrates the sequence of events in which a data request from a cache is performed. At time (1), the bus arbiter 110, in response to a query from one of the processors 102 (shown in FIG. 1), puts a data request on the data bus. At time (2), each cache controller 108 provides a “snoop” response, such as “retry,” “busy,” “valid data available,” etc. The bus arbiter “collects” the snoop responses, and at time (3) issues an “early combined response,” which is a hint (guess) as to where the valid data is stored. That is, the bus arbiter 110 puts out an early response predicting which cache, if any, has the valid coherent data. At time (4), the bus arbiter 110 issues a “combined response,” which is a final response back to the bus confirming which cache controller 108, if any, has control and access to the requested data (or else that the request will be retried due to a bus collision or other delay).
As systems become more complex, as in more processors 102 (each with a dedicated cache controller 108) being connected to the data bus 116, the delay between the data request and the final combined response becomes much longer in a non-linear manner. That is, adding twice as many processors results in a time delay that is more than twice as long between the initial data request and the final combined response. This is due in part to the super-linear amount of time required for all cache controllers 108 to snoop and respond to the data request, and for the bus arbiter 116 to evaluate all of the cache controller responses and formulate the final combined response for broadcast back to the data bus 116.
In the event that none of the cache memories 106 or 108 have the requested valid data, then the data must be retrieved from one of the system memories 114. In an effort to minimize total time delay required to retrieve the data from a system memory 114 after a cache “miss,” memory controllers 112 also “snoop” data requests on the data bus 116, and speculatively fetches data from their respective system memory 114 whenever the data request is for data at a memory address used by that system memory 114. That is, if a data request on data bus 116 is for data at an address used by system memory 114a, then memory controller 112a automatically speculatively pre-fetches the data at that address and stores the data in a queue in the memory controller 112a. This brute approach is highly inefficient, since many of the data requests are for data stored in cache memories, and thus an access to system memory is not needed. Automatically accessing the system memories 114 in this manner not only ties up valuable queue resources in the memory controller 112, but also delays necessary accesses to memory (cache misses), consumes excessive power, which also results in the generation of excessive heat and wastes valuable power, including battery power.