1. Field of the Invention
This invention relates to the field of microprocessor architectures. More particularly, the invention relates to branch caching and pipeline control strategies to reduce branching delays in multi-issue processors, especially very long instruction word (VLIW) digital signal processors (DSPs).
2. Description of the Related Art
Most processors, such as microprocessors, media processors, Digital Signal Processors (DSPs), and microcontrollers, employ one or more pipelines to allow multiple instructions to execute concurrently. In a pipeline, processor instruction execution is broken down into a sequence of sub-instruction phases (also known as pipeline stages). The clock rate of the processor is usually determined by the timing of the slowest phase. The processor clock rate can be increased by breaking an instruction down into many short stages, each of which can be executed very quickly. The pipeline stages are typically buffered so that in an N-stage pipeline, N stages from N sequential instructions can execute concurrently. When operating at peak capacity, during each clock cycle the pipeline is able to start the first stage of a new instruction while completing the final stage of the oldest instruction in the pipeline. This provides an effective peak pipeline throughput of one instruction per clock.
Multi-issue processors, such as those employing superscalar and VLIW architectures, can fetch multiple instructions per clock cycle and dispatch multiple instructions to multiple pipelines during each clock cycle. Thus, a processor with M pipelines can execute M instructions per clock. Use of many pipelines increases the number of instructions that can be executed per clock. Use of long pipelines, having shorter stages, allows faster clock rates. The fastest processors are those processors that have many long pipelines.
While each pipeline can deliver a peak throughput of one instruction per clock, it is the average number of instructions per clock that determines the total processor throughput during actual program execution. Especially in real-time applications such as multimedia and digital signal processing, the throughput of the processor executing a specific application code determines the performance, cost, and operability of a system. Hence, it is important to consider program execution and its effect on pipeline operation.
Pipeline performance is limited by a number of conditions, called xe2x80x9chazards,xe2x80x9d that arise in program execution, as discussed in xe2x80x9cComputer Architecture: A Quantitative Approach, 2nd Ed.xe2x80x9d by John Hennessy and David Patterson (Morgan Kaufmann Publishers, 1996). Three types of pipeline hazards exist: structural hazards; data dependency hazards; and control hazards. Hazards in the pipeline make it necessary to xe2x80x9cstallxe2x80x9d the pipeline. A pipeline stall occurs when the pipeline cannot accept a new instruction into the pipeline. A structural stall is said to occur if two different instructions at two different stages in the pipeline contend for the same hardware resource. A data dependency stall is said to occur if one instruction in the pipeline requires input data that is output from another instruction in the pipeline, and the output data is not yet ready. A control stall is said to occur if a branch, interrupt, or exception modifies the control flow of a program. A pipeline stall creates one or more bubbles, or empty slots in the pipeline. A control stall often causes many pipeline bubbles by causing the entire pipeline to be flushed. While structural and data dependency stalls can be dealt with according to prior art methods, control stalls remain more of a problem, especially in modem superscalar and VLIW systems with long pipelines.
While it is fairly easy to keep the pipeline full during sequential program operation, it becomes much more difficult to maintain pipeline throughput when a branch instruction changes the control flow in a program. This difficulty exists because the branch instructions are not typically resolved until later stages in the pipeline, and while the branch instruction makes its way through the pipeline, instructions in the pipeline may or may not be executed following the branch. When a branch is not taken, the next instruction executed after the branch is called the xe2x80x9cfall-throughxe2x80x9d instruction and the address of this instruction is called the fall-through address. When a branch is taken, the next instruction executed after the branch is called the xe2x80x9cbranch targetxe2x80x9d (target) instruction and the address of this instruction is called the target address. Branches are problematic because, when the unresolved branch instruction enters the first stage of the pipeline, the prefetch unit does not have enough information to know whether the next address will be the fall-through address or the target-address. Thus, the prefetch unit cannot fetch the next instruction, because it does not know which instruction will be executed next. In many cases, the prefetch unit will fetch the fall-through address (assume branch is not taken), and if the branch is taken, the processor will simply flush the pipeline and accept the time penalty. Since branch instructions typically account for approximately 20% of all instructions executed, this penalty can be severe.
There are several prior art techniques that attempt to address the pipeline stall problem. A first method, as described in U.S. Pat. No. 4,200,927, appears to use a plurality of instruction prefetch buffers and speculatively decodes instructions from both the fall-through address and the target address. The speculatively decoded instructions are then sent to an instruction queue that feeds the execution unit. When the execution unit resolves the direction of the branch path, the instructions from the path not taken are flushed from the queue. This approach cannot be applied to modem pipelines that execute one instruction per clock cycle because this approach relies on the fact that the execution unit is a microprogrammed state machine and requires multiple clock cycles to execute instructions. The lag time provided by multi-cycle operation allows the prefetch unit and the instruction decoder ample time to concurrently process more than one instruction stream. Modem processors include multiple pipelined execution units that operate at substantially the same speed as the prefetch unit and decoder. Hence, this technique is not applicable to modem systems.
Another prior art technique is speculative execution. Speculative execution uses a branch cache, also called a branch target buffer, and two execution units. The branch target buffer holds the branch target address to be forwarded to the prefetch unit and also holds a sequence of target instructions. When a branch is encountered, the branch target address is obtained from the branch target buffer and a second instruction stream is fetched from the branch target address. A separate pipeline is provided to allow both the fall-through instruction stream and the target instruction stream to be processed concurrently. This technique has the advantage that the control stall is completely removed, regardless of whether the fall-though or target path is eventually selected. While this technique avoids the delay due to a stall, it requires considerable additional hardware, including a branch cache, control hardware, a second pipeline, and a second execution unit. This additional hardware may be prohibitively expensive, especially for superscalar and VLIW processors. Superscalar and VLIW processors employ M pipelines and M multiple execution units, so that speculative execution requires a total of 2M pipelines and 2M execution units. In DSPs, some of these execution units are hardware multipliers that require a significant amount of chip area. Further, the speculative execution approach does not take advantage of any inefficiencies in instruction dispatch that may arise in multi-issue program execution due to data dependencies. Hence, the application of this technique is not practical since it would require a very large chip. Even when technology progresses to allow twice as much hardware to be integrated onto a single chip, that extra area would be put to better use by increasing the amount of on-board memory or by adding more execution pipelines.
Still another approach to dealing with control hazards is to use a branch prediction strategy. In branch prediction, a branch cache is used to monitor the most recently taken branches and to keep track of which way the branch has most often gone in the past. Based on past history, the most likely branch path is predicted and fetching begins from the predicted path. The branch cache will generally contain branch history information as well as the precomputed target address, and, in some cases, will contain one or more target instructions. This approach is more applicable to standard microprocessors and controllers, and is less applicable to VLIW processors. VLIW processors fetch very long instruction words (VLIWs) (also called fetch packets) which may contain many sub-instructions located in different fields of the VLIW. A group of sub-instruction fields issued to a set of pipelines simultaneously is known as an xe2x80x9cexecute packet.xe2x80x9d In some systems, the VLIW processor can take up to four pipeline stages just to bring the instruction into the prefetch buffer. If branch prediction is used in such a system, a correctly predicted branch will still cause a minimum of four cycles to be wasted. Further, if the prediction is incorrect and the stages are not buffered, then a branch stall occurs. Often the stall due to a mis-prediction is longer than a normal stall because a mis-prediction may invalidate various lines in the instruction cache and the data cache and thereby cause increased overhead due to cache misses. If the branches in the program are not predictable, then branch prediction may actually hamper performance due to cache miss overhead.
Branch prediction has other problems that limit its use in VLIW processors. VLIW processors execute looped code that is optimized using loop unrolling techniques whereby several loop iterations are unrolled into one macro-loop iteration. The branches in the looped code are highly predictable because the branch target instructions will be executed in all but the final iteration of the loop. This end condition is effectively dealt with by using a conditionally executed branch instruction. VLIW processors typically employ xe2x80x9cdelayed branchxe2x80x9d instructions whereby instructions that fill the pipeline immediately after the branch are allowed to conditionally execute. The delay slots behind the delayed branch can be effectively put to use in predictable inner-loop processing by filling the delay slots with target instructions. This same delayed branch technique can be used to improve performance of unconditional branches, such as subroutine calls and returns, simply by inserting the branch instruction several cycles ahead of where it will actually be executed. However, delayed branch techniques do not work well on a VLIW when dealing with data-dependent conditional branches. Some data-dependent conditional branches can be avoided by using conditionally executed instructions, but this technique wastes hardware resources and thus reduces throughput.
The present invention solves these and other problems by providing a pipeline architecture with a branch caching structure that reduces or eliminates pipeline stalls regardless of whether the fall-through or the target instruction is to be executed. The present architecture is hardware efficient and involves simple parallel operations that can be performed in a short clock cycle. The present architecture is useful for reducing branch related delays in a wide variety of processor architectures, including superscalar and VLIW processors with multiple pipelines and processors with long or short instruction fetch related pipeline stages. A further aspect of the present invention is a pipeline architecture and branch caching technique capable of handling the unpredictable branches that cannot be handled using loop unrolling and delayed branching in VLIW systems.
A further aspect of the present invention is a modified pipeline that allows branch instructions to be cached so that when a branch occurs, the pipeline stages that would otherwise have stalled, can be filled from the branch cache, thereby avoiding the stall. Yet another aspect of the present invention is to provide hardware to allow branch instructions to be detected early in the instruction pipeline, thereby providing time for the branch cache to operate in processors with very high clock rates. Another aspect of the present invention is an integrated pipeline, branch cache, and control structure that allows the processor to service branch cache misses without adding extra delay cycles. Another aspect of the present invention is an integrated pipeline, branch cache, and control structure that allows the processor to store data needed to service cache hits without incurring any delay cycles after the branch. Still another aspect of the present invention is an integrated pipeline, branch cache, and control structure that allows the processor to respond to cache hits while reducing the amount of branch cache space used to service cache hits without incurring delay cycles after the branch. Another aspect of the present invention is a multi-level branch cache structure which allows a reduced number of prefetch buffers to be stored for a given number of cache tag entries. Still another aspect of the invention is a control strategy that allows a pipeline to fill from the program cache when a target instruction would normally stall the pipeline.
Another aspect of the present invention is a method in a pipelined processor for reducing pipeline stalls caused by branching. The method comprises the steps of prefetching instructions into a first stage of the pipeline and propagating instructions into one or more subsequent stages of the pipeline. A conditional outcome is computed in one of the subsequent stages. Concurrently with processing at a specified stage in the pipeline, one or more instruction op-codes are analyzed to determine whether a cacheable branch instruction is present, and, if the branch instruction is present, a tag relating to the branch instruction is sent to a branch cache. The method includes the further steps of determining, in response to the conditional outcome, whether a branch is to be taken, and, if the branch is to be taken, sending a branch taken signal to the branch cache. If the conditional outcome indicates a branch is not to be taken, the method continues to fetch instructions into the pipeline and to execute the instructions. On receipt of the current branch tag, the branch cache performs the steps of examining a collection of stored branch tags to find a stored branch tag which matches the current branch tag. If the current branch tag is not found in the collection of stored branch tags and the branch is to be taken, the method signals a cache miss and causes the pipeline to fill one or more designated pipeline stages starting at a branch target address. The designated pipeline stages are pipeline stages that stall according to the branch. The branch cache stores the current branch tag and one or more instructions contained within the designated pipeline stages. If the branch taken signal is received and the current branch tag is found in the collection of stored branch tags, the method signals a cache hit and sends a branch target address to the prefetch unit so that instruction fetching can proceed from the branch target address. The method provides data stored in the cache to one or more of the designated pipeline stages so that execution can continue without delay irrespective of the conditional outcome.
Another aspect of the present invention is a computer processor which comprises an instruction pipeline comprising a plurality of stages. Each stage contains pipeline data. A branch cache comprises a plurality of cache lines. Each cache line comprises a stored branch tag and stored cache data. A branch cache controller is configured to detect a cacheable branch instruction in one of the pipeline stages. The branch cache controller receives a current branch tag from one of the pipeline stages. The branch cache controller receives conditional information indicative of whether the branch shall be taken. The branch cache controller attempts to match the current branch tag to a stored branch tag for a first cache line. If the branch is to be taken, the branch cache controller signals a cache miss when the attempt to match fails and signals a cache hit when the attempt to match succeeds. In response to the cache miss, the branch cache controller stores the current branch tag in the branch tag location of a designated cache line. The branch cache controller further stores data from one or more of the pipeline stages which stall in response to the cacheable branch instruction. The data from the stalled pipeline stages are stored in the cache data location of the designated cache line. In response to the cache hit, the branch cache controller loads one or more of the pipeline stages from the stored cache data to avoid a pipeline stall from the cacheable branch instruction.
Another aspect of the present invention is a computer processor which comprises an instruction pipeline which comprises a plurality of stages. Each stage contains data. The processor includes means for storing data from one or more of the pipeline stages and for restoring data to one or more of the pipeline stages. The processor further includes means for controlling the means for storing. The means for controlling causes the branch cache to store data from one or more of the pipeline stages in response to execution of a cacheable branch instruction which triggers a cache miss. The means for controlling also causes the means for storing to restore data to one or more of the pipeline stages in response to a cache hit, thereby avoiding pipeline stalls when a cache hit occurs.
Another aspect of the present invention is a method in a pipelined microsystem such as a microprocessor, DSP, media processor, or microcontroller. The method is a method to load branch instruction information into a branch cache so as to allow the branch instruction to execute subsequently with a reduced or eliminated time penalty by minimizing the amount of information to be cached. The method comprises the step of: monitoring the instruction stream in a dispatch unit in a pipeline stage to detect whether a branch instruction of a selected type is present. When the branch instruction is detected, the method signals to a branch cache control unit that the instruction is present. The method makes available at least a portion of an address of the branch instruction to the branch cache control unit. The method compares the portion of the address of the branch instruction to a set of cache tags containing branch instruction address related information. When the branch instruction does not match any tag, the method fills the branch cache entry so that, when the branch instruction is next encountered, the tag will match and the branch target stream can proceed without delay. When program execution makes a branch target fetch packet available to be cached to allow the target instruction stream to execute to a target prefetch buffer, the method loads data from the target prefetch buffer into a position in the branch cache line associated with the branch instruction and sets a counter to a prespecified number, d, corresponding to the maximum possible number of fetch packets that may need to be cached. The method decrements the counter on each subsequent cycle. The method loads subsequent fetch packets from the target instruction stream into the branch cache line only when they are fetched. The method exits the branch cache fill operation when the counter has decremented to a specified number such that the branch cache line is filled with the appropriate number of target prefetch packets that are fetched in the first d time slots when the target instruction stream is executed. Preferably, the method includes the further step of loading stall override bits into the branch cache line. The stall override bits indicate for each of the d cycles whether or not the branch cache will supply the target fetch packet during a given cycle. Also preferably, the method includes the further step of storing a condition field to indicate a register or an execute stage which supplies the conditional branch information so that the branch cache can resolve the branch early. Also preferably, the method includes the step of supplying an auxiliary link field which points to a next prefetch buffer of the cache line. The auxiliary link field creates a linked list in a variable-length cache line structure. Preferably, the method further includes the step of caching shadow dispatch unit pre-evaluation data to allow a shadow dispatch unit to dispatch instructions using less hardware than the dispatch unit.
Another aspect of the present invention is a method for a pipelined microsystem such as a microprocessor, DSP, media processor, or microcontroller. The method services branch cache hits so as to reduce or eliminate cycle loss due to branching. The method comprises the step of monitoring the instruction stream in a pipeline stage to detect whether a branch instruction of a selected type is present. When the branch instruction is detected, the method signals to a branch cache control unit that the instruction is present. At least a portion of an address of the branch instruction is made available to the branch cache control unit. The method further includes the step of comparing the portion of the address of the branch instruction to a set of tags containing branch instruction address related information. When the branch instruction does match a tag and the branch is evaluated to be taken, the method performs the steps of reading a target prefetch buffer out of the branch cache and supplying the target prefetch buffer to a shadow dispatch unit. The prefetch buffer is dispatched from the shadow dispatch unit to a multiple execution pipeline in units of execute packets. Instructions are prefetched at a full prefetch rate irrespective of whether multiple cycles are required to dispatch a fetch packet. The prefetching of instructions continues at a full prefetch rated until early pipeline stages catch up to later pipeline stages. As a result, the target instruction stream proceeds at full speed and only a minimum number of fetch packets needed to support full speed execution are fetched from the branch cache.
Another aspect of the present invention is method for a pipelined microsystem such as a microprocessor, DSP, media processor, or microcontroller. The method services branch cache hits so as to reduce or eliminate cycle loss due to branching. The method comprises the step of monitoring the instruction stream in a pipeline stage to detect whether a branch instruction of a selected type is present. When the branch instruction of a selected type is detected, the method signals to a branch cache control unit that the instruction is present, and makes at least a portion of the branch instruction""s address available to the branch cache control unit. The method includes the further step of comparing the portion of an address of the branch instruction to a set of tags containing branch instruction address related information. When the branch instruction does match a tag and the branch is evaluated to be taken, the method performs the step of reading the target prefetch buffer out of the branch cache. The contents of the target prefetch buffer are supplied to a multiplexer which routes the contents of the target prefetch buffer back to the dispatch unit. The contents of the target prefetch buffer are dispatched to the pipeline in units of execute packets. Instructions are prefetched by the pipeline at full speed, irrespective of whether it takes multiple cycles to dispatch a fetch packet, until the early pipeline stages catch up to the later pipeline stages. As a result, the target instruction stream proceeds at nearly full speed, and only a minimum number of fetch packets needed to support full speed execution are fetched from the branch cache.
Another aspect of the present invention is a method for a VLIW processor which fetches groups of instructions in fetch packets and dispatches subsets thereof as execute packets in one or more clock cycles. The method reduces the size of a branch cache which buffers branch target information. The method comprises the steps of caching the target prefetch buffer when a branch cache miss is detected; and caching a variable number of immediately following prefetch buffers. The number of cached prefetched buffers is the number of prefetch buffers that are fetched in the target instruction stream during the first d cycles of execution, where the number d is related to the number of pipeline stages that would otherwise stall when a branch occurs.
Another aspect of the present invention is a branch cache to be used in a multi-issue processor having an address generate portion in a prefetch unit. The processor dispatches in each clock cycle variable numbers of instructions contained in each fetch packet. The cache comprises a plurality of lines. Each line comprises a tag field which holds information relating to the addresses of branch instructions. The information includes address information of branch instructions of a selected type or types. Each cache line also comprises a branch address field which holds an address near to the branch target address, so that this near address can be forwarded to the program address generate portion of the prefetch unit for target instruction stream fetching. A prefetch buffer field in each cache line holds the first prefetch buffer of the target instruction stream. At least one link field in each cache line indicates whether more prefetch buffers are associated with the tag field. At least one extra prefetch buffer field is provided in each cache line. Preferably, the number of extra prefetch buffer fields is determined by initial prefetch activity of the target instruction stream. Also preferably, each cache line additionally comprises a pipeline stall override field which signals the prefetch unit to continue to fetch instructions when there would otherwise be a pipeline stall due to multiple execute packets being dispatched from a single target fetch packet. Also preferably, additional prefetch buffers of the cache line are arranged in a linked list structure.
Another aspect of the present invention is a method to fill an instruction pipeline after a branch instruction is detected which selects a target instruction stream. The method comprises the steps of reading a prefetch buffer out of the branch cache line associated with the instruction which caused the branch cache hit; sending the cached prefetch buffer to a shadow dispatch unit; routing the output of the shadow dispatch unit to a multiplexer which selects instruction information from a dispatch unit in the execution pipeline or from a shadow dispatch unit; providing a select signal which forces the multiplexer to select the cached fetch packet from the shadow dispatch unit; forwarding the fetch packet to decoder stages of an execution pipeline in units of execute packets; allowing the prefetch stages of the instruction pipeline to continue functioning irrespective of how many execute packets are in each fetch packet until the instruction pipeline is filled; and supplying the requisite number of fetch packets from the branch cache to allow the target instruction stream to proceed without adding extra delay cycles.
Another aspect of the present invention is a method to fill an instruction pipeline after a branch instruction is detected which selects a target instruction stream. The method comprises the steps of reading a prefetch buffer out of the branch cache line associated with the instruction which caused the branch cache hit; sending the cached prefetch buffer to a dispatch unit; routing the output of the shadow dispatch unit to decoder stages of an execution pipeline in units of execute packets; allowing the prefetch stages of the instruction pipeline to continue functioning irrespective of how many execute packets are in each fetch packet until the instruction pipeline is filled; and supplying the requisite number of fetch packets from the branch cache to allow the target instruction stream to proceed without adding extra delay cycles.
Another aspect of the present invention is a method to detect and control the branch cache related processing of branch instructions in processing systems comprising a first cacheable branch instruction type and a second non-cacheable branch instruction type. The method comprises the step of evaluating bits located in an instruction that passes through a selected stage of an instruction pipeline to determine whether the instruction corresponds to a cacheable branch instruction. If the instruction corresponds to a cacheable branch instruction, the method performs the step of evaluating a condition and a tag associated with the instruction to determine whether data needs to be read out of a branch target buffer. If the instruction is not a branch instruction or is a non-cacheable branch instruction, the method continues processing of the instruction and aborts any subsequent branch cache processing for the instruction.
Another aspect of the invention is a pipelined processor which includes a branch acceleration technique which is based on an improved branch cache. The improved branch cache minimizes or eliminates delays caused by branch instructions, especially data-dependent unpredictable branches. In pipelined and multiply pipelined machines, branches can potentially cause the pipeline to stall because the branch alters the instruction flow, leaving the prefetch buffer and first pipeline stages with discarded instructions. This has the effect of reducing system performance by making the branch instruction appear to require multiple cycles to execute. The improved branch cache differs from conventional branch caches. In particular, the improved cache is not used for branch prediction, but rather, the improved branch cache avoids stalls by providing data that will be inserted into the pipeline stages that would otherwise have stalled when a branch is taken. Special architectural features and control structures are supplied to minimize the amount of information that must be cached by recognizing that only selected types of branches should be cached and by making use of available cycles that would otherwise be wasted. The improved branch cache supplies the missing information to the pipeline in the place of the discarded instructions, completely eliminating the pipeline stall. This technique accelerates performance, especially in real-time code that must evaluate data-dependent conditions and branch accordingly.