FIG. 14 is a simplified diagram showing a conventional microprocessor (processor) 1400 that utilizes an instruction buffer (i.e., decode instruction buffer (DIB) 122, discussed below) for storing fetched program instructions before issuance to an execution pipeline. Processor 1400 is generally consistent with the TriCore™ family of processor devices produced by Infineon Technologies AG of Munich, Germany. Those skilled in the art of processors will recognize that the description of processor 1400 is greatly simplified for explanatory purposes, and that some of the circuit components described separately below may be integrated with other components, or omitted entirely.
Processor 1400 is generally partitioned into a pre-fetch stage 110, a fetch/pre-decode stage 115, a decode stage 120, and an execution stage 130. Pre-fetch stage 110 includes program counter 111 and a memory management unit (MMU) 112 that cooperate to transmit address signals used to read corresponding program instructions from a system (e.g., cache, local, and/or external) memory 101, which then writes these program instructions to fetch/pre-decode stage 115. Fetch/pre-decode stage 115 includes a fetch portion 116 having program memory interface (PROG MEM INTRFC) 117 for receiving the program instructions, and a pre-decode portion 118 including a decode instruction buffer input circuit 119 that partially decodes the instructions, and writes the instructions into decode stage 120 in the manner described below. Decode stage 120 includes DIB 122 and a decode/issue circuit 125. Execution stage 130 includes the processor “pipeline” that executes the decoded program instructions issued from decode stage 120. In the present example, execution stage 130 includes two processor pipelines: a load/store (LS) pipeline 132, and an integer processing (IP) pipeline 136. Each pipeline includes two execution stages (i.e., EX1 and EX2) and a write back stage. Processor 1400 also includes loop counter register 105A, which in the present example stores a loop counter value. Note that loop counter register 105A may be one of several general-purpose registers provided by processor 1400.
DIB 122 can be logically represented as a circular buffer having several registers (e.g., four registers REG1-REG4), an input (write) pointer controlled by DIB input circuit 119, and one or more output pointers controlled by decode/issue circuit 125. The write pointer points to one of registers REG1-REG4, and fetch/pre-decode stage 115 writes one, two, three or four instructions to the pointed-to register each write cycle. For example, in a first write cycle the write point points to REG1 and four 16-bit instructions are written to REG1, then in a next write cycle the write pointer points to REG2 and two 32-bit instructions are written to REG2 . . . then the write pointer points to REG4 and one 32-bit instruction and two 16-bit instructions are written to REG4, then the write point returns to REG1 and new instructions are written into REG1. Note that previously written instructions are issued from each register before new instructions are written to that register. Also, depending on the processor, one or more of these instructions are issued from registers REG1-REG4 to execution stage 130 during each issue cycle, where the decoded instructions are either to LS pipeline 132 or IP pipeline 136, depending on the issued instruction's “type”. For example, in a first issue cycle, a first 16-bit or 32-bit IP-type instruction is issued to IP pipeline 136 and a second 16-bit or 32-bit LS-type instruction is issued to LS pipeline 132 from DIB register REG1. Depending on the processor, the order in which the LS-type instructions and IP-type instructions are arranged may determine whether one or two instructions are issued per issue cycle. For example, in a second issue cycle, a third 16-bit or 32-bit LS-type instruction (which follows the previously-issued second LS-type instruction) may be issued to LS pipeline 132 from REG1 (i.e., because the second and third instructions are LS instructions, no IP instruction is issued during the second issue cycle). This issue process continues, first issuing from REG1, then moving to REG2, REG3, and REG4, respectively, and then return to REG1. By storing and issuing several instructions in registers REG1–REG4 in this manner, DIB 122 acts as an instruction buffer that allows fetch/pre-fetch stage 115 to operate at a different speed than execution stage 130, which facilitates high speed processing.
Operation of processor 1400 typically involves processing (executing) a software program, which is a predetermined series of program instructions read from system memory 101 that collectively cause processor 1400 to perform a desired computing task. During development of such software programs, the program instructions are generally arranged in the order in which they are processed (executed), and the thus-arranged program instructions are assigned (stored) in corresponding sequential memory locations in system memory 101 prior to execution by processor 1400.
Program instructions can be generally classified as operations, which are sequentially executed in execution stage 130, and branch (or jump) instructions that cause program control to “jump” from one instruction to an out-of-order instruction. One conditional branch instruction that is often used in software programs is a loop instruction, which allows a program to repeatedly execute an instruction (or series of instructions) a specified number of times, or until a certain condition is met. Almost all programming languages have several different loop instructions designed for different purposes.
FIG. 15 is a simplified diagram depicting a portion 1500 of a software program that utilizes a commonly used type of loop instruction. Each instruction INST0 through INST12 of program portion 1500 is assigned a sequentially arranged address X0000 through X1100, respectively, that represents a corresponding memory location in memory 101 (FIG. 14). For sake of brevity, the operations performed by instructions INST0 through INST12 are only indicated for instructions that are relevant to the following discussion. For example, instruction INST1 sets a loop counter R1 to integer value three (indicated by “[R1==3]”), and loop instruction INST9 is a loop instruction that functions as described below. The functions of the other instructions (i.e., INST0, INST2–INST8, and INST10–INST12) perform operations that are sequential in nature (i.e., these instructions do not produce a non-sequential change in program control).
In the present example, loop instruction INST9 is of a type that functions to decrement a designated loop counter (i.e., loop counter R1 in this example) by one each time loop instruction INST9 is executed, to pass program control to a target instruction (i.e., address X0010, which makes instruction INST2 the target instruction of loop instruction INST9 in this example) while loop counter R1 is greater than zero, and to pass program control to the next sequential (fall-through) instruction following the loop instruction (i.e., instruction INST10 in this example) when loop counter R1 equals zero. As utilizes herein, the term “taken” refers to the case where, when the loop instruction is executed, program control jumps to the target instruction, and the term “not-taken” refers to the case where program control passes to the loop's fall-through instructions. Accordingly, while loop counter R1 remains greater than zero, loop instruction INST9 is “taken” operation, and program control jumps to target instruction INST2. The “loop body” (i.e., instructions INST2–INST8) is thereby repeatedly executed until loop counter R1 is decremented to zero, when the loop is “not-taken”, and program control passes to fall-through instruction INST10.
Referring back to the top of FIG. 14, during execution of the software program, program counter 111 typically generates sequential program counter values NEXT_PC that are converted by MMU 112 to memory addresses used to sequentially access the memory locations in memory 101, thereby reading and processing the program instructions in the prearranged order. When branch or jump instructions (e.g., loop instructions) are executed, a non-sequential value (INJECTED_PC) is transmitted to program counter 111, and a corresponding non-sequential address is transmitted to memory 101. The thus-reset program counter/MMU then proceeds to generate sequential addresses subsequent to the injected address until another interruption occurs.
Referring again to FIG. 15, during “loop entry” (i.e., the first pass through the instructions preceding loop instruction INST9), pre-loop instructions INST0 and INST1 are executed (setting loop counter R1 to three), then the loop body is executed for the first time, then loop instruction INST9 is executed for the first time (indicated by the left-most arrow A in FIG. 15). As indicated, loop instruction INST9 decrements loop counter R1 to two (R1=2), determines that the value stored in loop counter R1 does not equal zero, and therefore causes a “loop taken” operation in which program control passes back to instruction INST2 (address X0010). “Inner loop” processing of the loop body is then performed during which loop counter R1 is decremented to one (R1=1) during a second iteration, and to zero (R1=0) during a third iteration, each time loop instruction INST9 causing another “loop taken” operation. “Loop exit” occurs when loop instruction INSTR9 is encountered for the fourth time and loop counter R1 equals zero, which results in a “loop not-taken” operation that passes program control to fall-through instruction INST10. Program execution then proceeds to sequentially execute instructions (e.g., instruction INST11 and then INST12) until another branch or jump is encountered.
A problem with processors that utilize instruction buffers (i.e., processors similar to processor 1400; discussed above) is that the conditional branch operation of a loop instruction (i.e., whether the loop instruction is taken or not-taken) is decided when the loop instruction is executed (e.g., when the loop instruction is issued to LS pipeline 132; see FIG. 14). As mentioned above, when loop instruction INST9 is taken, program control passes (jumps) to target instruction INST2. The problem is that, after fetching loop instruction INST9, program counter 111 and MMU 112 continue to fetch sequentially addressed instructions from memory 101 until the execution stage generates the injected counter value associated with target instruction INST2. That is, at the time loop instruction is executed, several fall-through instructions (e.g., INST10–INST12) have been fetched and stored in the various stages preceding execution stage 130, and target instruction INST2 has not yet been fetched. Accordingly, processor 1400 must wait after each loop iteration (i.e., each time loop instruction INST9 is executed) while target instruction INST2 and subsequent loop body instructions are fetched, passed through the various processor stages, and issued to execution stage 130. Consequently, each loop iteration produces a “loop taken penalty”, which is typically measured by the number of processor clock cycles between executing the loop instruction and executing that loop's target instruction. The loop taken penalty is particularly large when, as in the case of processor 1400, a processor includes several stages and an instruction buffer (i.e., DIB 122) preceding the execution stage because of the number of processor clock cycles required for the target instruction to pass through these stages.
What is needed is a processor that is able to minimize the loop taken penalty. Ideally, what is needed is a “zero-overhead” processor that eliminates the loop taken penalty and executes loop instructions without consuming any execution cycles of the processor.