The present invention relates generally to data processing techniques and systems and, more particularly, to prefetching techniques and systems.
A shared-memory multiprocessor can be described as a set of processors, memory, cell controllers and an interconnect (the latter two elements sometimes being collectively referred to as a “chipset”). The processors are clustered in cells and a dedicated chip (the cell controller) serves as an interface between the cell's processors, on the one hand, and memory and the interconnection network to other cells, on the other hand. When an application running on a processor needs some data or instructions, e.g., because a load instruction missed in all caches local to the processor, a request for that data and/or instructions is sent by the processor to the cell controller. The cell controller forwards the request to other cell controllers and/or to main memory to obtain the requested data or instructions, however the delay associated with external requests for data introduces undesirable latency and should be minimized.
One technique used to reduce the latency associated with cache misses is known as prefetching. Prefetching refers to requesting and storing data or instructions in advance of it being requested by an application to avoid cache misses. Since cache memory space is limited, it is not possible to prefetch all of the data or instructions which potentially may be requested by applications. Thus, prefetching involves speculation as to which data or instructions will be needed in the near future by an application.
Various prefetching mechanisms are currently available. For example, software prefetching involves advance data requests initiated by application code (i.e., the prefetching instructions are embedded in the application code) or by a helper thread, in either case running on a processor of the cell. A helper thread can eavesdrop on the stream of requests made by the application threads, speculate as to which data requests will be next, prefetch that data and store it in cache memory. Alternatively, the helper thread can execute those portions of the application code that compute target addresses such that the helper thread has advance knowledge of potentially needed data and can initiate a prefetch sequence. However, software prefetchers generally require sophisticated compiler technology, which is nonetheless frequently unable to compute target addresses sufficiently in advance of the application's need for data. Moreover, these software prefetchers require significant processor bandwidth, thereby reducing the amount of that valuable resource available to the primary software applications running on the system.
Prefetching mechanisms can also be implemented in hardware. For example, a hardware prefetcher can be implemented on the processor running the primary software application. This engine can observe the behavior of the application, e.g., the stream of data requests that the application makes, speculate as to the data that will be required to satisfy future requests and prefetch that data. Alternatively, hardware prefetchers can be implemented in the cell controllers and/or the interconnect. These hardware prefetching mechanisms are hampered by their relative inability to properly guess upcoming target addresses of the applications, as well as the correct timing for prefetching data from those target addresses.
Accordingly, it would be desirable to provide systems and methods for flexibly prefetching data and instructions which would overcome these drawbacks and limitations.