The function of a microprocessor is to execute programs. Programs comprise a group of instructions. In a conventional unpipelined processor, the instruction is first fetched. Next, the instruction is decoded. Its register or memory references are then determined, and the operands are fetched from those sources. When all operands are ready, the execution unit performs the desired operation on these operands. Finally, the result is stored in the specified location when the execution completes. Thereupon, execution of the next instruction begins.
Processor performance can be improved by reducing the time it takes to execute a program. One technique for increasing the performance involves using a technique known as "pipelining". Pipelining is a well-known mechanism for extracting useful concurrency from a program's execution. This is accomplished by taking advantage of the various stages instructions pass through. The key to pipelining is that at any given phase of the instruction's execution lifetime, it is making use of only a few of the processor's many resources (e.g., instruction cache, fetching, decoding, register lookups, memory reads, execution unit operations, destination writes, etc.). Moreover, each instruction tends to make use of these resources in the same sequence.
Hence, in a pipelined processor, each step of a pipeline completes a portion of the execution of an instruction. Each of the steps in the pipeline is called a pipe stage. The pipe stages are separated by clocked registers or latches. The steps required to execute an instruction are executed independently in different pipeline stages, provided that there is a dedicated part of the processor for each pipe stage. The result of each pipeline stage is communicated to the next pipeline stage via the register between the stages. It should be noted that although pipelining does not decrease the total amount of time required to execute an instruction, it does minimize the average number of cycles required to execute a program by permitting the processor to handle more than one instruction at a time. Furthermore, superscalar processors issue multiple instructions at a time. In this manner, a processor with multiple execution units can execute multiple instructions concurrently. This type of superscalar processor performs concurrent execution of instructions in the same pipeline stage, as well as concurrent execution of instructions in different pipeline stages.
It is natural to assume that as the pipeline is made deeper by implementing more stages, that the processor's performance would be enhanced. This statement is true, up to a certain point. For digital signal processors, such as compact disc players, extremely deep (e.g., 100+ stages) pipelines work fine. However, for general purpose processors, deep pipelines suffer from branching effects.
When a branch in the computer program is encountered, it is first fetched and decoded. Then, its operands are determined and executed. Next, its result is written back. At that point, four clock cycles have already been expended on the branch's successor instruction because it is pipelined along, same as any other following instruction. If the successor to the branch does not directly follow the branch in the program's source code, all of the partly-completed work in the pipeline behind the branch is erroneous and must be discarded. In other words, due to the fact that pipelined processors fetch the next instruction before they have completely executed the previous instruction, the next instruction fetch could have been to the wrong place if the previous instruction were a branch. The deeper the pipeline, the more partly-completed work that must be discarded and restarted.
One example where branches are encountered involves the use of string operations. Virtually all computer programs include strings. A string is a collection or set of characters stored continuously in memory in a particular sequence. To move or otherwise manipulate a string, the processor iterates over a number of elements of the string. In other words, the processor "loops" through a number of iterations over the length of the string. When a program loops, there must be a mechanism to bring the execution path out of the loop at the proper time. One method for halting execution of the loop entails predicting, at the end of each loop iteration, whether the loop is to be repeated. However, if the prediction is wrong, it can take a long time to recover from the misprediction, especially for those processors having a deep pipeline. The penalty associated with a branch misprediction for long strings does not seriously impact performance. However, the strings found in most programs are of short or medium length. The performance degradation incurred with branch mispredictions in these cases can be quite severe.
Thus, there is a need in the prior art for handling string operations in a processor having a pipelined architecture without relying on branch predictions.