1. Field of the Invention
This invention relates to the field of microprocessor architectures. More particularly, the invention relates to branch caching and pipeline control strategies to reduce branching delays in multi-issue processors, especially very long instruction word (VLIW) digital signal processors (DSPs).
2. Description of the Related Art
Most processors, such as microprocessors, media processors, Digital Signal Processors (DSPs), and microcontrollers, employ one or more pipelines to allow multiple instructions to execute concurrently. In a pipeline, processor instruction execution is broken down into a sequence of sub-instruction phases (also known as pipeline stages). The clock rate of the processor is usually determined by the timing of the slowest phase. The processor clock rate can be increased by breaking an instruction down into many short stages, each of which can be executed very quickly. The pipeline stages are typically buffered so that in an N-stage pipeline, N stages from N sequential instructions can execute concurrently. When operating at peak capacity, during each clock cycle the pipeline is able to start the first stage of a new instruction while completing the final stage of the oldest instruction in the pipeline. This provides an effective peak pipeline throughput of one instruction per clock.
Multi-issue processors, such as those employing superscalar and VLIW architectures, can fetch multiple instructions per clock cycle and dispatch multiple instructions to multiple pipelines during each clock cycle. Thus, a processor with M pipelines can execute M instructions per clock. Use of many pipelines increases the number of instructions that can be executed per clock. Use of long pipelines, having shorter stages, allows faster clock rates. The fastest processors are those processors that have many long pipelines.
While each pipeline can deliver a peak throughput of one instruction per clock, it is the average number of instructions per clock that determines the total processor throughput during actual program execution. Especially in real-time applications such as multimedia and digital signal processing, the throughput of the processor executing a specific application code determines the performance, cost, and operability of a system. Hence, it is important to consider program execution and its effect on pipeline operation.
Pipeline performance is limited by a number of conditions, called "hazards," that arise in program execution, as discussed in "Computer Architecture: A Quantitative Approach, 2nd Ed." by John Hennessy and David Patterson (Morgan Kaufmann Publishers, 1996). Three types of pipeline hazards exist: structural hazards; data dependency hazards; and control hazards. Hazards in the pipeline make it necessary to "stall" the pipeline. A pipeline stall occurs when the pipeline cannot accept a new instruction into the pipeline. A structural stall is said to occur if two different instructions at two different stages in the pipeline contend for the same hardware resource. A data dependency stall is said to occur if one instruction in the pipeline requires input data that is output from another instruction in the pipeline, and the output data is not yet ready. A control stall is said to occur if a branch, interrupt, or exception modifies the control flow of a program. A pipeline stall creates one or more bubbles, or empty slots in the pipeline. A control stall often causes many pipeline bubbles by causing the entire pipeline to be flushed. While structural and data dependency stalls can be dealt with according to prior art methods, control stalls remain more of a problem, especially in modern superscalar and VLIW systems with long pipelines.
While it is fairly easy to keep the pipeline full during sequential program operation, it becomes much more difficult to maintain pipeline throughput when a branch instruction changes the control flow in a program. This difficulty exists because the branch instructions are not typically resolved until later stages in the pipeline, and while the branch instruction makes its way through the pipeline, instructions in the pipeline may or may not be executed following the branch. When a branch is not taken, the next instruction executed after the branch is called the "fall-through" instruction and the address of this instruction is called the fall-through address. When a branch is taken, the next instruction executed after the branch is called the "branch target" (target) instruction and the address of this instruction is called the target address. Branches are problematic because, when the unresolved branch instruction enters the first stage of the pipeline, the prefetch unit does not have enough information to know whether the next address will be the fall-through address or the target-address. Thus, the prefetch unit cannot fetch the next instruction, because it does not know which instruction will be executed next. In many cases, the prefetch unit will fetch the fall-through address (assume branch is not taken), and if the branch is taken, the processor will simply flush the pipeline and accept the time penalty. Since branch instructions typically account for approximately 20% of all instructions executed, this penalty can be severe.
There are several prior art techniques that attempt to address the pipeline stall problem. A first method, as described in U.S. Pat. No. 4,200,927, appears to use a plurality of instruction prefetch buffers and speculatively decodes instructions from both the fall-through address and the target address. The speculatively decoded instructions are then sent to an instruction queue that feeds the execution unit. When the execution unit resolves the direction of the branch path, the instructions from the path not taken are flushed from the queue. This approach cannot be applied to modern pipelines that execute one instruction per clock cycle because this approach relies on the fact that the execution unit is a microprogrammed state machine and requires multiple clock cycles to execute instructions. The lag time provided by multi-cycle operation allows the prefetch unit and the instruction decoder ample time to concurrently process more than one instruction stream. Modem processors include multiple pipelined execution units that operate at substantially the same speed as the prefetch unit and decoder. Hence, this technique is not applicable to modem systems.
Another prior art technique is speculative execution. Speculative execution uses a branch cache, also called a branch target buffer, and two execution units. The branch target buffer holds the branch target address to be forwarded to the prefetch unit and also holds a sequence of target instructions. When a branch is encountered, the branch target address is obtained from the branch target buffer and a second instruction stream is fetched from the branch target address. A separate pipeline is provided to allow both the fall-through instruction stream and the target instruction stream to be processed concurrently. This technique has the advantage that the control stall is completely removed, regardless of whether the fall-though or target path is eventually selected. While this technique avoids the delay due to a stall, it requires considerable additional hardware, including a branch cache, control hardware, a second pipeline, and a second execution unit. This additional hardware may be prohibitively expensive, especially for superscalar and VLIW processors. Superscalar and VLIW processors employ M pipelines and M multiple execution units, so that speculative execution requires a total of 2M pipelines and 2M execution units. In DSPs, some of these execution units are hardware multipliers that require a significant amount of chip area. Further, the speculative execution approach does not take advantage of any inefficiencies in instruction dispatch that may arise in multi-issue program execution due to data dependencies. Hence, the application of this technique is not practical since it would require a very large chip. Even when technology progresses to allow twice as much hardware to be integrated onto a single chip, that extra area would be put to better use by increasing the amount of on-board memory or by adding more execution pipelines.
Still another approach to dealing with control hazards is to use a branch prediction strategy. In branch prediction, a branch cache is used to monitor the most recently taken branches and to keep track of which way the branch has most often gone in the past. Based on past history, the most likely branch path is predicted and fetching begins from the predicted path. The branch cache will generally contain branch history information as well as the precomputed target address, and, in some cases, will contain one or more target instructions. This approach is more applicable to standard microprocessors and controllers, and is less applicable to VLIW processors. VLIW processors fetch very long instruction words (VLIWs) (also called fetch packets) which may contain many sub-instructions located in different fields of the VLIW. A group of sub-instruction fields issued to a set of pipelines simultaneously is known as an "execute packet." In some systems, the VLIW processor can take up to four pipeline stages just to bring the instruction into the prefetch buffer. If branch prediction is used in such a system, a correctly predicted branch will still cause a minimum of four cycles to be wasted. Further, if the prediction is incorrect and the stages are not buffered, then a branch stall occurs. Often the stall due to a mis-prediction is longer than a normal stall because a misprediction may invalidate various lines in the instruction cache and the data cache and thereby cause increased overhead due to cache misses. If the branches in the program are not predictable, then branch prediction may actually hamper performance due to cache miss overhead.
Branch prediction has other problems that limit its use in VLIW processors. VLIW processors execute looped code that is optimized using loop unrolling techniques whereby several loop iterations are unrolled into one macro-loop iteration. The branches in the looped code are highly predictable because the branch target instructions will be executed in all but the final iteration of the loop. This end condition is effectively dealt with by using a conditionally executed branch instruction. VLIW processors typically employ "delayed branch" instructions whereby instructions that fill the pipeline immediately after the branch are allowed to conditionally execute. The delay slots behind the delayed branch can be effectively put to use in predictable inner-loop processing by filling the delay slots with target instructions. This same delayed branch technique can be used to improve performance of unconditional branches, such as subroutine calls and returns, simply by inserting the branch instruction several cycles ahead of where it will actually be executed. However, delayed branch techniques do not work well on a VLIW when dealing with data-dependent conditional branches. Some data-dependent conditional branches can be avoided by using conditionally executed instructions, but this technique wastes hardware resources and thus reduces throughput.