Embodiments of the present invention relate to operation of a processor, and more particularly to obtaining data for use in a processor.
When data needed for a processor operation is not present in the processor, a latency, which is the time it takes to load the data into the processor, occurs. Such a latency may be low or high, depending on where the data is obtained from within various levels of a memory hierarchy. Accordingly, prefetching schemes are used to generate and transmit prefetch requests corresponding to data or instructions that are predicted to be needed by a processor in the near future. When the prediction is correct and data is readily available to an execution unit, latencies are reduced and increased performance is achieved. Prefetching schemes are typically based on a prediction of data locations to be accessed based on the location of current read requests.
In addition to a latency incurred in requesting data from a remote location (e.g., memory, mass storage or the like), in many systems a processor socket may have its own latency associated with accessing data from within or outside the processor socket. These delays, which are applicable both to actual read requests as well as prefetch requests generated in the processor socket, can be associated with delays in routing and coherency determinations. For example, in systems implementing a point-to-point (PTP) interconnect system, a coherency protocol may be established such that a processor socket first determines whether a request (i.e., actual or prefetch) corresponds to a coherent memory location prior to sending the request from the processor socket. Such delays within a processor socket can incur a significant amount of cycles before a request is even sent out of the processor socket. For example, it may take 100 or more cycles before routing and coherency determinations are made and a request is ready to be transmitted from a processor socket. Such delays negatively affect performance.