Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data. However, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data.
The performance of programs running on graphics processors is largely affected by whether memory instructions hit or miss in the cache due to long latencies (e.g., up to 400 cycles) attributed to performing a memory access. Thus, the higher the magnitude of memory instruction misses, the worse the performance. To mitigate this, compilers can insert prefetch instructions well before these memory instructions execute. The prefetch instructions retrieve data into cache before it is needed by the memory instruction, thus increasing the chances of a cache hit. The problem is that inserting prefetch instructions without knowledge of which exact instructions miss in the cache is sub-optimal.