Modern microprocessors include a cache memory. The cache memory, or cache, stores a subset of data stored in other memories of the computer system. When the processor executes an instruction that references data, the processor first checks to see if the data is present in the cache. If so, the instruction can be executed immediately since the data is already present in the cache. Otherwise, the instruction must wait to be executed while the data is fetched from the other memory into the cache. The fetching of the data may take a relatively long time—in some cases an order of magnitude or more longer than the time needed for the processor to execute the instruction to process the data.
Many software programs executing on a microprocessor require the program to manipulate a large linear chunk of data. For example, the linear chunk of data might be between 100 to 1,000,000 bytes. Examples of such programs are multimedia-related audio or video programs that process a large chunk of data, such as video data or wave file data. Typically, the large chunk of data is present in an external memory, such as in system memory or a video frame buffer. In order for the processor to manipulate the data, it must be fetched from the external memory into the processor.
If a needed piece of data is not present in the cache, the disparity in data fetching and data processing time may create a situation where the processor is ready to execute another instruction to manipulate the data, but is stalled, i.e., sitting idle waiting for the data to be fetched into the processor. This is an inefficient use of the processor, and may result in reduced multimedia system performance, for example.
In addressing this problem, modern microprocessors have recognized that many times the programmer knows he will need the data ahead of the time for execution of the instructions that actually process the data, such as arithmetic instructions. Consequently, modern microprocessors have added to or included in their instruction sets prefetch instructions to fetch a cache line of the data into a cache of the processor before the data is needed. A cache line is the smallest unit of data than can be transferred between the cache and other memories. An example of a modern microprocessor with a prefetch instruction is the Intel Pentium III® processor. The Pentium III includes a PREFETCH instruction in its Streaming SIMD Extensions (SSE) to its instruction set.
In many software applications, a programmer knows he will be manipulating a large linear chunk of data, i.e., many cache lines. Consequently, programmers insert prefetch instructions, such as the Pentium III PREFETCH, into their programs to prefetch a cache line. The programmer inserts the prefetch instructions multiple instructions ahead of the actual instructions that will perform the arithmetic or logical operations on the data in the cache line. Hence, a program may have many prefetch instructions sprinkled into it. These added prefetch instructions increase the size of the program code as well as the number of instructions that must be executed.
Furthermore, under the conventional method, not only does the programmer have to sprinkle prefetch instructions into the code, but he also has to try to place them in the code so as to optimize their execution. That is, the programmer has to try to determine the timing of the execution of the prefetch instructions so that the data is in the cache when it is needed. In particular, the programmer attempts to place the prefetch instructions in the code so they do not clobber one another. That is, in conventional processors if a prefetch instruction is currently executing and a subsequent prefetch instruction comes along, one of the prefetch instructions may be treated as a no-op instruction and not executed. This does not accomplish what the programmer wanted, and likely results in lower performance.
One problem a programmer faces when hand-placing prefetch instructions is the variability of core/bus clock ratio. In many modern microprocessors, the clock frequency of the processor bus that connects the processor to the rest of the system is not the same as the clock frequency at which the logic inside the processor operates, which is commonly referred to as the core clock frequency. The core/bus clock ratio is the ratio of the processor core clock frequency to the processor bus clock frequency.
The difference in core clock and processor bus clock frequency is attributed in part to the fact that it is common to sort processors as they are produced according to the core clock frequency that a given integrated circuit will reliably sustain. Hence, it may be that a given processor design will sort into batches of four different core clock frequencies, such as 800 MHz, 900, MHz, 1 GHz, and 1.2 GHz. However, all of these batches of processors must operate in motherboards that are designed to run at one or two fixed bus clock frequencies, such as 100 MHz or 133 MHz. Hence, in the example above, eight different combinations of core/bus clock ratios may occur. Consequently, there may be eight different numbers of core clocks that are required for a typical prefetch to complete.
The fact that a range exists of core clocks required for a typical prefetch to complete makes it very difficult for a programmer to effectively hand-place conventional prefetch instructions. This may be shown by the following example. Assume the highest core/bus clock ratio is 12, and assume a typical prefetch instruction takes about 10 bus clocks or about 120 core clocks. Assume the programmer is programming a loop that processes a single cache line of data, and the loop takes approximately 60 core clocks to execute and is not dependent upon bus activity other than the bus activity generated by the prefetch instruction.
In this case, the programmer may choose to execute a prefetch instruction every other iteration of the loop, i.e., every 120 core clocks, to accommodate the highest core/bus ratio. The programmer's choice may work well if the ratio is 12. However, if the user has a system in which the ratio is 6, a typical prefetch instruction only takes about 60 core clocks, which is only one iteration through the loop. In this scenario, a prefetch instruction will be active only half the time, which may result in stalls of the processor waiting for the data to be fetched into the cache.
Therefore, what is needed is a microprocessor that supports a prefetch instruction that facilitates efficient prefetching. What is also needed is for the prefetch instruction to efficiently fit into the Pentium III opcode space.