1. Field of the Invention
This invention relates generally to computers, and more particularly, to a computer product, method, and apparatus for load operations.
2. Description of the Related Art
Modern computers contain microprocessors, which are essentially the brains of the computer. In operation, the computer uses the microprocessor to run computer programs.
A computer program might be written in a high-level computer language, such as C or C++, using statements similar to English, which are then translated (by another program called a compiler) into numerous machine language instructions. A program might also be written in assembly language, and then translated (by another program called an assembler) into machine language instructions. In practice, every computer language above assembly language is a high-level language.
Each computer program contains numerous instructions which tell the computer what it must do to achieve the desired goal of the program. The computer runs a particular computer program by executing the instructions contained in that program.
Modem computers also contain memory. The memory might be used to store computer program data, or it might be used to store computer program instructions. In general, every individual location in a computer memory has an address associated with it. The address might be a physical address or a virtual address. A physical address is one that corresponds to a fixed hardware memory location; a virtual address does not. Specifically, in microprocessors which support virtual addressing, computer programs reference virtual addresses, which are then mapped by memory management hardware onto physical addresses before the memory is actually read or written.
A memory cache is a special sub-system in which frequently used data is stored for quick access, e.g. it stores the contents of frequently accessed memory locations and the address where those data items belong. When a microprocessor attempts to perform a load reference to an address in memory, the cache is checked to see whether it holds that address/data. If it does, the data is returned to the microprocessor from the cache and no reference is sent to memory. If it does not, a regular memory access occurs and the missing data is commonly copied from memory into the cache. When a microprocessor attempts to perform a store reference to an address in memory, again the cache is checked to see whether it holds that address. If it does, the cache will be updated with the store data. The store may also be sent to memory (write-through policy) or not (write-back policy). If the cache does not hold the store address (or the line in the cache is also contained within another device""s cache, i.e. in a SHARED state), then the store may be sent directly to memory (write-through policy) or the missing data may be copied from memory into the cache and then updated (in the cache) with the store data (typical write-back policy). Accessing a memory cache is faster than accessing memory.
RAM or Random Access Memory, is a semiconductor-based memory that can be read and written by the microprocessor or other hardware devices. The storage locations can be accessed in any order. RAM is the type of memory frequently used as main memory on a personal computer.
Most modern microprocessors use a design technique called pipelining, where each operation is performed in a series of pipeline stages. In operation, a microprocessor fetches an instruction from memory and feeds it into one end of the pipeline. The pipeline is made up of several stages, each stage performing some function or process necessary or desirable to process the instruction before passing the instruction to the next stage. Thus the output of one stage serves as input to a second, the output of the second stage serves as input to the third, and so on. Therefore, in any clock cycle, more than one instruction may be in the process of execution (one per stage, or more than one per stage if the stages have multiple functional units).
Ideally, pipelining speeds execution time by ensuring that the microprocessor does not have to wait for instructions; when it completes execution of one instruction, the next is ready and waiting.
In some advanced microprocessors, the pipeline is designed to support the processing of selected instructions speculatively. Speculative execution is a technique in which certain instructions are executed and results made available before they are determined to be needed by the program. Consequently, it also involves determining whether the need ever actually occurs, and if it does, making sure that the results of what was done ahead of time are still valid. Once all these questions about a speculatively executed instruction have been answered favorably, the instruction is said to be resolved, retired, or architecturally committed, and is no longer speculative.
One class of instructions frequently contained in a computer program are store instructions. Store instructions are assembly or machine level instructions that cause information to be written by the executing processor into a particular location (address) in memory.
Another class of instructions frequently contained in a computer program are load instructions. Load instructions are assembly or machine level instructions that cause data to be taken from a particular location (address) in memory, and placed into a specified register within the executing processor so that the data can be acted upon during execution of a subsequent instruction.
An important source of performance loss in modern microprocessors is waiting for data to be returned from long latency load operations. In the sequence of instructions contained in a computer program, a load instruction often closely precedes the instruction that acts upon the data loaded. Because such an instruction needs to wait for the load operation to complete before it can begin its execution, time spent waiting for completion of the load operation delays execution of the computer program.
One technique used to reduce this delay involves changing the sequence of instructions in the computer program so that the load occurs earlier than it would in the normal sequence of instructions. This change in sequence may be done by the compiler. Moving a load up-stream from its normal position in the sequence of instructions is sometimes called advancing the load or boosting the load. The basic idea is to start the load operation as early as possible, giving as much time as possible for the load operation to complete before any instructions dependent on the load are encountered in the sequence of instructions. Store instructions, however, limit how far ahead a load instruction may be advanced. This limit arises because the compiler often cannot determine whether a load instruction and a store instruction conflict, that is, whether they are reading from and writing to overlapping physical memory locations.
In the unoptimized sample code fragment,
add r1+r2xe2x86x92r3
store [r4], r5
sub r6xe2x88x92r7xe2x86x92r8
load [r9]xe2x86x92r10
and r10, r11xe2x86x92r12
the r1, r2, and so forth are registers. The brackets around r4 and r9 are used to denote that the contents of r4 and r9 are to be used as the addresses for the store and load operations. If the compiler cannot determine whether r4 and r9 are referring to overlapping physical memory locations, then r4 and r9 are referred to as being unresolved with respect to each other, or as undisambiguated memory addresses.
In this example, since the load instruction (the next-to-last instruction) and the instruction that uses the data loaded (the last instruction, i.e. the xe2x80x9candxe2x80x9d instruction) are only separated by one clock cycle, then if the load instruction has a latency of over one clock cycle, the microprocessor will not have the data needed by the xe2x80x9candxe2x80x9d instruction available in time, and, consequently, will need to defer or stall execution of the xe2x80x9candxe2x80x9d instruction and potentially all later instructions.
Traditionally, a compiler will try to move the load instruction as far ahead as possible. In the optimized sample code fragment,
add r1+r2xe2x86x92r3
store [r4], r5
load [r9]xe2x86x92r10
sub r6xe2x88x92r7xe2x86x92r8
and r10, r11xe2x86x92r12
the load instruction has been boosted to just below the store instruction. The load instruction is two clock cycles away from the dependent use xe2x80x9candxe2x80x9d instruction. But unless the compiler can determine that the address of the load, r9, and the address of the earlier store instruction, r4, refer to non-overlapping memory addresses, it is not safe to move the load instruction past the store instruction. Moving the load above the store would be unsafe because if the load operation and the store operation are to overlapping target addresses, the load operation needs to get the data from the store operation. This mandatory requirement would be violated if the load instruction ended up earlier in the instruction sequence than the store instruction. Consequently, boosting of load instructions has been limited by the presence of store instructions.
The present invention is directed to overcoming, or at least reducing the effects of one or more of the problems mentioned above.