As processors become ever faster, increasingly the bottleneck restricting processing throughput is the speed—or lack thereof—of computer memory in responding to processor directives. This “memory latency” is a very serious problem, because processors process instructions and data much faster than these instructions and data can be retrieved from memory. Today, the speed with which microprocessors can process instructions commonly is rated in gigahertz. Unfortunately, overall system performance is hamstrung by motherboards operating between one hundred and three hundred megahertz, i.e., almost an order of magnitude slower.
To make matters worse, the disparity between the speed of processor clocks and memory clocks is growing. Currently, the ratio of processor clock speed to memory clock speed typically is 8:1, but that ratio is predicted to increase to 100:1 in the next few years. Compounding the problem is the fact that a memory system may require ten or more of its own memory clock cycles to respond to a memory retrieval request, thus, the ratio for a complete memory cycle is far worse. Today, completion of one full memory cycle may result in the waste of hundreds of processing cycles. In the near future, based on current performance trends in microprocessors, completion of a memory cycle may result in the waste of thousands of processing cycles.
To reduce delays caused by memory latency, processors incorporate an execution pipeline. In the execution pipeline, a sequence of instructions to be executed is queued to avoid the interminable memory retrieval delays that would result if each instruction were retrieved from memory one at a time. However, if the wrong instructions and/or data have been loaded into the pipeline, the processor will fall idle while the wrong instructions are cleared and replaced with the correct instructions.
FIG. 1 is a flowchart illustrating these problems and some of the solutions. To expedite processing, once a program or routine is initiated, at 110 instructions are queued in the execution pipeline, and the processor begins to execute the queued instructions at 130. The processor continues executing instructions from the pipeline until one of two things happens. If the processor reaches the end of the queued instructions at 140, the processor will wait idle at 150 until the next instructions are queued, then resume executing queued instructions at 130. In this instance, memory pages storing the next instructions may be in the process of being opened to transfer their contents to the execution pipeline, so the memory latency delay may not be too lengthy.
If the processor has not reached the end of the instructions queued in the execution pipeline, delays still may result when conditional branch instructions are encountered. A typical CPU may sequentially load a range instructions from memory in the order they appear, ignoring the possibility that a conditional branch instruction in that range could redirect processing to a different set of instructions. FIGS. 2A and 2B represent two situations in which instructions were loaded into the execution pipelines 210 and 220, respectively, making the assumption that the conditional branch would not be taken, and queuing the instructions following the conditional branch instruction in the execution pipelines 210 and 220. In both FIGS. 2A and 2B, the conditional branch will be taken if “VARIABLE” is equal to CONDITION.”
In the situation depicted in FIG. 2A, it is assumed that VARIABLE is not equal to CONDITION. Therefore, the conditional branch is not taken. As a result, the next instructions that should be processed are those immediately following the conditional branch instruction. Thus, as it turns out, queuing the instructions following the conditional branch was the correct course of action, and the processor can continue processing the next instructions in the execution pipeline without delay, as though the conditional branch instruction did not exist.
On the other hand, FIG. 2B depicts the situation in which VARIABLE is equal to CONDITION. As a result, the branch is taken rather than executing the next queued instructions as in the example shown in FIG. 2A. Because the execution pipeline had been loaded with instructions on the assumption that the conditional branch would not be followed, this is considered to be an unexpected branch 160 (FIG. 1). Because the condition is met and the branch must be taken, then the instructions following the conditional branch, which were queued as they were in the execution pipeline 210 in FIG. 2A, will not be processed. Accordingly, the execution pipeline 220 must be cleared as shown in FIG. 2B, and the processor will fall idle while the execution pipeline is reloaded. Having to reload the execution pipeline 220 as shown in FIG. 2B is comparable to the situation if the execution pipeline had not been loaded with any instructions beyond the conditional branch instruction. Thus, the entire queuing process begins anew at 110 (FIG. 1) with the processor waiting for a full memory retrieval cycle to get the next instruction, “INSTRUCTION AFTER BRANCH 1,” which eventually is loaded into the pipeline at 230.
The taking of an unexpected branch 160 may result in a significantly longer processor idle interval than the processor reaching the end of the queued instructions at 150. If the processor reaches the end of the queued instructions, the next needed instructions may be in the process of being fetched to the execution pipeline. If the instructions are in the process of being retrieved, only a few processor cycles might remain before the instructions reach the execution pipeline. However, if an unexpected branch is taken as at 160, the retrieval of the next instructions starts anew, and hundreds of processor cycles might pass before the next instructions reach the execution pipeline.
To avoid processing delays resulting from unexpected branching, techniques such as branch speculation and prediction have been devised. With reference to FIG. 1, speculation and/or prediction 180 occurs once a conditional branch instruction like “IF VARIABLE=CONDITION” has been encountered at 170. Using speculation or speculative branching, instructions queued in the pipeline are previewed, and if an instruction comprises a conditional branch, the system speculates as to the outcome of the branch condition, and loads in the execution pipeline instructions and data from the predicted branch. Speculation renders an educated guess by attempting to precalculate the key variable to project the likelihood the branch is taken, and instructions from the more or most likely branch are queued for processing.
If the correct educated guess is made, the effect is the same as if the instructions in sequence were loaded ignoring any possible branches, as shown in FIG. 2A, and the processor can continue processing without having to wait for new instructions to be retrieved. However, if the speculation incorrectly predicts the branch, incorrect and unusable instructions will have been loaded in the pipeline, and the effect is the same as illustrated in FIG. 2B. The processor will, therefore, fall idle while instructions in the pipeline are cleared and replaced with the instructions from the branch actually followed. In sum, speculation can avoid wasted processing cycles, but only if the speculation routine guesses correctly as to what branch will be followed.
Prediction is a technique which exploits multiscalar or superscalar processors. A multiscalar processor includes multiple functional units which provide independent execution slots to simultaneously and independently process different, short word instructions. Using prediction, a multiscalar processor can simultaneously execute both eventualities of an IF-THEN-ELSE-type instruction, making the outcome of each available without having to wait the time required for the sequential execution of both eventualities. Based on the parallel processing of instructions, the execution pipeline can be kept filled for more than one branch possibility. “Very Long Instruction Word” processing methodologies, such as Expressly Parallel Instruction Computing (“EPIC”) devised by Intel and Hewlett-Packard, are designed to take advantage of multiscalar processors in this manner. The EPIC methodology relies on the compiler to detect such potential parallelism and generated object code to exploit multiscalar processing.
FIG. 2C depicts a scenario in which a microprocessor with two functional units processes instructions in two execution slots in parallel. Upon encountering the same conditional branch instruction as seen in FIGS. 2A and 2B, the width of the execution 230 pipeline allows it to be partitioned into a first execution slot 240 and a second execution slot 250, each of which is loaded with instructions conditioned on each possibility. The first execution slot 240 is loaded with instructions responsive to the possibility that “VARIABLE” is not equal to “CONDITION” and the branch is not taken, and the second execution slot 250 with instructions responsive to the possibility that “VARIABLE=CONDITION” and the branch is taken. Both of these sets of instructions can be loaded and executed in parallel. As a result, no processing cycles are lost in having to reload the pipeline if an unexpected branch is not taken.
Prediction, too, has many limitations. Of course, if available processing parallelism is not detected, prediction simply will not be used. In addition, if the instructions are long word instructions such that a single instruction consumes all of the available functional units, there can be no parallel processing, and, thus, no prediction. Alternatively, because a string of conditional branches potentially can invoke many different possible branches, the possibility clearly remains that instructions might be loaded into the execution pipeline for an incorrect branch. In such a case, the result would be that as illustrated in FIG. 2B, where the pipeline must be emptied and reloaded while the processor falls idle.
In sum, the object of branch speculation, and/or prediction is to avoid wasting processor by filling the execution pipeline with instructions are most likely to be needed as a result of a conditional branch or with parallel sets instructions to allow for multiple conditional branch outcomes, respectively. However, even if speculation or prediction help to fill the execution pipeline with the appropriate instructions, those instructions might invoke other branches, routine calls, or data references, which may not be resolved until the processor actually processes the instruction. This would result in memory latency delays even when branch speculation or prediction work as intended.
For example, referring to FIG. 2C, the empty lines in execution slot 250 represent the time lost as a result of the reference to “BRANCH” in the first execution slot. Although instructions can continue to be loaded into execution slot 240, the memory page where “BRANCH” is stored must be opened before the instructions at that address can be retrieved into the pipeline. Similarly, instruction 270 calls for data to be retrieved from memory and moved into a register. Empty spaces in the execution slot 250 represent the delay which results while the memory page where “dataref” is stored is opened. Once again, the processor would fall idle during the many cycles required to retrieve the referenced information from memory.
Cache memory may avoid some of these delays by reducing the time required to retrieve information from memory by transferring portions of the contents of memory into fast memory devices disposed on the microprocessor itself (level one cache) or directly coupled to the microprocessor (level two cache). Typically, the processor can retrieve data from level two cache usually in half the time it can retrieve data from main memory, and in one-third or even one-sixth the time it would take to retrieve the same data from main memory. When a processor calls for instructions or data from memory, other information stored nearby in memory also are transferred to cache memory because it is very common for a large percentage of the work done by a particular program or routine to be performed by programming loops manifested in localized groups of instructions.
However, the use of cache memory does not completely solve the memory latency problem. Unless the desired data happens to be present in cache, the presence of cache memory saves no time at all. Cache memory has only a small fraction of the capacity of main memory, therefore, it can store only a fraction of the data stored in main memory. Should the processor call for data beyond the limited range of data transferred to cache, the data will have to be retrieved from memory, again leaving the processor idle for tens or hundreds of cycles while the relevant memory pages are fetched.
What is needed is a way to help expedite the retrieval fetching of memory pages from memory into the execution pipeline to avoid or reduce memory latency delays. It is to improving this process that the present invention is directed.