This invention relates in general to the field of prefetch instructions in microprocessors, and more particularly to a microprocessor having a prefetch instruction that prefetches a specified number of cache lines.
Modern microprocessors include a cache memory. The cache memory, or cache, stores a subset of data stored in other memories of the computer system. When the processor executes an instruction that references data, the processor first checks to see if the data is present in the cache. If so, the instruction can be executed immediately since the data is already present in the cache. Otherwise, the instruction must wait to be executed while the data is fetched from the other memory into the cache. The fetching of the data may take a relatively long timexe2x80x94in some cases an order of magnitude or more longer than the time needed for the processor to execute the instruction to process the data.
Many software programs executing on a microprocessor require the program to manipulate a large linear chunk of data. For example, the linear chunk of data might be between 100 to 1,000,000 bytes. Examples of such programs are multimedia-related audio or video programs that process a large chunk of data, such as video data or wave file data. Typically, the large chunk of data is present in an external memory, such as in system memory or a video frame buffer. In order for the processor to manipulate the data, it must be fetched from the external memory into the processor.
If a needed piece of data is not present in the cache, the disparity in data fetching and data processing time may create a situation where the processor is ready to execute another instruction to manipulate the data, but is stalled, i.e., sitting idle waiting for the data to be fetched into the processor. This is an inefficient use of the processor, and may result in reduced multimedia system performance, for example.
In addressing this problem, modern microprocessors have recognized that many times the programmer knows he will need the data ahead of the time for execution of the instructions that actually process the data, such as arithmetic instructions. Consequently, modern microprocessors have added to or included in their instruction sets prefetch instructions to fetch a cache line of the data into a cache of the processor before the data is needed. A cache line is the smallest unit of data than can be transferred between the cache and other memories. An example of a modern microprocessor with a prefetch instruction is the Intel Pentium III(copyright) processor. The Pentium III includes a PREFETCH instruction in its Streaming SIMD Extensions (SSE) to its instruction set.
In many software applications, a programmer knows he will be manipulating a large linear chunk of data, i.e., many cache lines. Consequently, programmers insert prefetch instructions, such as the Pentium III PREFETCH, into their programs to prefetch a cache line. The programmer inserts the prefetch instructions multiple instructions ahead of the actual instructions that will perform the arithmetic or logical operations on the data in the cache line. Hence, a program may have many prefetch instructions sprinkled into it. These added prefetch instructions increase the size of the program code as well as the number of instructions that must be executed.
Furthermore, under the conventional method, not only does the programmer have to sprinkle prefetch instructions into the code, but he also has to try to place them in the code so as to optimize their execution. That is, the programmer has to try to determine the timing of the execution of the prefetch instructions so that the data is in the cache when it is needed. In particular, the programmer attempts to place the prefetch instructions in the code so they do not clobber one another. That is, in conventional processors if a prefetch instruction is currently executing and a subsequent prefetch instruction comes along, one of the prefetch instructions may be treated as a no-op instruction and not executed. This does not accomplish what the programmer wanted, and likely results in lower performance.
One problem a programmer faces when hand-placing prefetch instructions is the variability of core/bus clock ratio. In many modern microprocessors, the clock frequency of the processor bus that connects the processor to the rest of the system is not the same as the clock frequency at which the logic inside the processor operates, which is commonly referred to as the core clock frequency. The core/bus clock ratio is the ratio of the processor core clock frequency to the processor bus clock frequency.
The difference in core clock and processor bus clock frequency is attributed in part to the fact that it is common to sort processors as they are produced according to the core clock frequency that a given integrated circuit will reliably sustain. Hence, it may be that a given processor design will sort into batches of four different core clock frequencies, such as 800 MHz, 900, MHz, 1 GHz, and 1.2 GHz. However, all of these batches of processors must operate in motherboards that are designed to run at one or two fixed bus clock frequencies, such as 100 MHz or 133 MHz. Hence, in the example above, eight different combinations of core/bus clock ratios may occur. Consequently, there may be eight different numbers of core clocks that are required for a typical prefetch to complete.
The fact that a range exists of core clocks required for a typical prefetch to complete makes it very difficult for a programmer to effectively hand-place conventional prefetch instructions. This may be shown by the following example. Assume the highest core/bus clock ratio is 12, and assume a typical prefetch instruction takes about 10 bus clocks or about 120 core clocks. Assume the programmer is programming a loop that processes a single cache line of data, and the loop takes approximately 60 core clocks to execute and is not dependent upon bus activity other than the bus activity generated by the prefetch instruction.
In this case, the programmer may choose to execute a prefetch instruction every other iteration of the loop, i.e., every 120 core clocks, to accommodate the highest core/bus ratio. The programmer""s choice may work well if the ratio is 12. However, if the user has a system in which the ratio is 6, a typical prefetch instruction only takes about 60 core clocks, which is only one iteration through the loop. In this scenario, a prefetch instruction will be active only half the time, which may result in stalls of the processor waiting for the data to be fetched into the cache.
Therefore, what is needed is a microprocessor that supports a prefetch instruction that facilitates efficient prefetching. What is also needed is for the prefetch instruction to efficiently fit into the Pentium III opcode space.
The present invention provides a microprocessor that supports a prefetch instruction that allows a programmer to specify the number of cache lines to prefetch. Accordingly, in attainment of the aforementioned object, it is a feature of the present invention to provide a microprocessor that executes a prefetch instruction specifying a block of cache lines to be prefetched from a system memory into a cache of the microprocessor. The microprocessor includes a prefetch count register that stores a count of the cache lines remaining to be prefetched. The microprocessor also includes a general purpose register, coupled to the prefetch count register, that stores an initial value of the count. The initial value is loaded into the general purpose register by an instruction prior to the prefetch instruction. The microprocessor also includes control logic, coupled to the prefetch count register, that copies the initial value from the general purpose register to the prefetch count register in response to decoding the prefetch instruction.
In another aspect, it is a feature of the present invention to provide a microprocessor. The microprocessor includes an instruction decoder that decodes instructions in an instruction set. The instruction set includes at least a set of instructions defined by an Intel Pentium III processor. The instruction set also includes a repeat prefetch instruction. The repeat prefetch instruction includes a Pentium III PREFETCH instruction opcode, a Pentium III REP string instruction prefix preceding the opcode, and a count specifying a number of cache lines to be prefetched.
In another aspect, it is a feature of the present invention to provide a microprocessor in a system with a system memory. The microprocessor includes an instruction decoder that decodes a prefetch instruction specifying a count of cache lines to prefetch from the system memory and an address in the system memory of the cache lines. The microprocessor also includes an address register, coupled to the instruction decoder that stores the address specified in the prefetch instruction. The microprocessor also includes a count register, coupled to the instruction decoder that stores the count specified in the prefetch instruction. The microprocessor also includes control logic, coupled to the address register, which controls the microprocessor to prefetch the cache lines specified in the address register and the count register from the system memory into a cache memory of the microprocessor.
In another aspect, it is a feature of the present invention to provide a method of a microprocessor prefetching cache lines into its cache. The method includes detecting a repeat prefetch instruction specifying a count of cache lines for prefetching from a system memory address, copying the count from a general purpose register of the microprocessor to a prefetch count register, and storing the address in a prefetch address register. The method also includes prefetching a cache line specified by the prefetch address register into the cache, decrementing the prefetch count register, and incrementing the prefetch address register. The method also includes repeating the prefetching, decrementing, and incrementing steps until the prefetch count register reaches a zero value.
One advantage of the present invention is that it is backward compatible with the existing x86 instruction set architecture. This is because the Pentium III does not generate an exception for a PREFETCH instruction preceded by a REP prefix. Therefore, software programs may be written that include the repeat prefetch instruction of the present invention to execute more efficiently on a microprocessor supporting the repeat prefetch instruction according to the present invention, and the program will also execute correctly on a Pentium III.
Another advantage is that the present invention preserves x86 opcode space by re-using the PREFETCH opcode in combination with the REP prefix to virtually create a new opcode. A further advantage is that the present invention potentially reduces software code size over conventional single-cache line prefetch instructions because fewer prefetch instructions need to be included in the program. A still further advantage is that the present invention potentially improves system performance by making more efficient use of the processor bus than the conventional method. A still further advantage is that the present invention potentially improves processing performance by moving data into the microprocessor cache more efficiently than the conventional method by alleviating the problems caused by the fact that a range of core clock to processor bus clock ratios may exist.
Other features and advantages of the present invention will become apparent upon study of the remaining portions of the specification and drawings.