Embodiments of the inventive subject matter generally relate to the field of system and processor architecture, and, more particularly, to reducing instruction issuance latency.
Conventional data processing systems ranging from mobile and embedded devices to super computers typically include one or more processing elements (e.g., central processing units, graphics processing units, co-processors or the like) frequently embodied within one or more integrated circuits for the purpose of processing data resident within one or more elements of a data storage hierarchy. The majority of such processing elements are designed to operate in a pipelined fashion, whereby data processing is broken down into a series of steps or “stages” with associated logic elements separated by storage buffers or registers typically implemented with “flip-flop” or “latch” circuits. Advancement of instructions through the pipeline is typically controlled or synchronized via the application of a clock signal to all components of the processing element.
Pipelining typically yields a number of advantages over similar non-pipelined architectures. As multiple pipeline stages can operate substantially simultaneously, integrated circuit logic is used more efficiently than in non-pipelined architectures were functional units or logic elements may sit idle. Consequently, overall instruction throughput in terms of the number of instructions performed per unit time is typically increased. Many pipelined processing elements are capable of issuing or completing at least one instruction per clock cycle and such systems are said to be “fully pipelined”.
While pipelining increases instruction throughput, it does not decrease, but rather actually typically slightly increases, the execution time of an individual instruction. Conventional pipelined processor designs therefore typically suffer from a number of known drawbacks. Most of the drawbacks associated with pipelined processors are due to the potential for hazards to occur which prevent subsequent instructions from advancing in the pipeline and completing execution during their associated pipeline slots or clock cycles. Hazards fall into three classes, structural, control, and data. Structural hazards arise from resource conflicts when system hardware cannot support all possible combinations of instructions in overlapped execution. Control hazards arise from pipelining of branches and other instructions that change the processor program counter (PC). Data hazards arise when an instruction depends on the results of a previously instruction in a way that is exposed by the overlapping of instructions in the pipeline.
One technique used to address data hazards in modern processors without “stalling” instruction processing is the use of result forwarding. In result forwarding, instruction processing (e.g., execution) results are rerouted prior to reaching a final pipeline stage to be used in the processing of a subsequent instruction. FIG. 1 illustrates a processing element including early result forwarding according to the prior art.
In the processing element of FIG. 1, data of each of two separate instruction operands is retrieved or applied from a register file (not shown) to each of two corresponding multiplexers (110A and 110B). In the embodiment of FIG. 1, the illustrated processing element is capable of processing operands and generating results having a data width of 2N where N is a positive integer value. Operands received and selected utilizing multiplexers 110 are applied to corresponding unpack—2N blocks 112a and 112b which convert the received data from a 2N-bit wide bit external or “interface” format into an internal format, utilized in operand processing by an associated execution unit such as execution unit 116.
Once converted, internal format operands are stored in corresponding operand registers 114 as shown. In the depicted processing element of FIG. 1, operand registers 114A and 114B are implemented as multiplexer (MUX) latches capable of both storing operands to be processed as well as to select between unpack block unit inputs and forwarded results inputs further described herein. A corresponding pack—2N block 118 converts the execution result data from an internal format back into the 2N-bit wide interface format. Interface format results generated by execution unit 116 may then be stored in a result register 120 from which the result, upon selection utilizing a global result multiplexer 122, may be reapplied to the register file as instruction processing completes and/or applied to the operand registers of the same or another processing element as shown.
In the processing element of FIG. 1, early forwarding support is provided via buses 124 or 126 as shown. Results produced by execution unit 116 may be provided from an output of the execution unit to operand registers 114A and 114B in internal format via bus B 124 as will be described in further detail with respect to FIG. 2, or alternatively in a 2N bit wide interface format from result register 120 to multiplexers 110A and 110B via bus A 126 as will be described in further detail with respect to FIG. 3. It should be noted that bus B 124 as depicted in FIG. 1 is private to execution unit 116, whereas bus A 126 is shared among a plurality of execution units having access to a common register file. Consequently, bus A 126 carries each execution result only in the cycle in which the result is to be sent to the register file and bus R cannot be guaranteed to be selected via global result multiplexer 122 in other cycles. FIG. 2 illustrates a timing diagram depicting early result forwarding via an internal operand format bus according to the prior art.
In the timing diagram of FIG. 2, a processing element clock cycle at which operands of a given instruction are available at inputs of operand registers 114A and 114B is referred to as cycle “RF”. Based on the overall pipeline depth and the number of logic levels of an associated processing element, a processing result is available on bus R, denoting an output of result register 120, K cycles after RF, where K is a positive integer value representing the number of pipeline stages of the processing element as a whole. The timing diagram of FIG. 2 depicts the processing of two successive instructions, INSTR 1 and INSTR 2, where the INSTR 2 instruction depends on an execution result of INSTR 1.
At an initial clock cycle RF, indicated by the left-most timing interval, operands of INSTR 1 are available at inputs of operand registers 114A and 114B. At an immediately subsequent clock signal cycle (RF+1) INSTR 1's operands enter a first pipeline stage of execution unit 116. INSTR 1 executes in a pipelined fashion and subsequently at clock signal cycle RF+K−1, completes execution to generate an intermediary (i.e., internal format) result, which is forwarded to at least one of operand registers 114A and 114B via an early result forwarding bus, bus B 124 to serve as a data operand of dependent INSTR 2. While this stage of instruction processing is indicated as occurring at clock signal cycle RF+K−1 for INSTR 1, it is indicated as INSTR 2's initial clock signal cycle, RF. In the same clock signal cycle, INSTR 1's result is packed via pack—2N block 118 and available at the input of result register 120. In an immediately subsequent clock signal cycle (RF+K for INSTR 1), INSTR 1's result is available at the output of result register 120.
In the same clock signal cycle in which INSTR 1 completes and is applied to the result register, data operands (including the forwarded result of the execution of INSTR 1) for INSTR 2 enter the first pipeline stage of execution unit 116. From the perspective of the second, dependent instruction, this clock signal cycle is viewed as cycle RF+1 as depicted in the figure. In the same manner that INSTR 1 was executed, dependent instruction INSTR 2 traverses the pipeline of execution unit 116, arriving at the execution unit's output at clock signal cycle RF+K−1 (RF+2K−2 from the perspective of INSTR 1) and at the output of result register 120 one clock cycle later at (RF+K) as shown. As is apparent from the timing diagram of FIG. 2, utilizing an internal format early result forwarding bus (e.g., bus B 124) a dependent instruction (INSTR 2) may be issue, i.e., applied to an associated execution unit, K−1 cycles after the original instruction (INSTR 1) is issued.
FIG. 3 illustrates a timing diagram depicting early result forwarding via an interface operand format bus according to the prior art. In the timing diagram of FIG. 3, result forwarding is accomplished utilizing bus A 126 which is coupled to and accessible by multiple execution units as described previously. As in FIG. 2, FIG. 3 depicts the processing of two successive instructions, INSTR 1 and INSTR 2, where INSTR 2 depends on an execution result of INSTR 1. Similarly to the process previously described, at an initial clock cycle RF, indicated by the left-most timing interval, operands of INSTR 1 are available at an input of operand registers 114A and 114B. At an immediately subsequent clock signal cycle (RF+1) INSTR 1's operands enter a first pipeline stage of execution unit 116, executing in a pipelined fashion and subsequently completing execution at clock signal cycle RF+K−1, to generate an intermediary (i.e., internal format) result. In the same clock signal cycle this intermediary result is packed via pack—2N block 118 and available at the input of result register 120. In an immediately subsequent clock signal cycle (RF+K), the packed result is available at the output of result register 120 and forwarded to at least one of multiplexers 110A and 110B and unpack—2N blocks 112A and 112B via early result forwarding bus A 126, coinciding with the arrival and latching of dependent instruction INSTR 2 within operand registers 114A and 114B. Thus, utilizing bus A 126 to forward results in a 2N-bit wide interface format (with its associated additional packing and unpacking operations) dependent instructions (e.g., INSTR 2) forwarded using this technique issue K cycles after an associated original instruction (e.g., INSTR 1). The time period necessary between execution unit issuance of dependent instructions is known as “issue to issue” latency in processing element design.
Another technique for increasing overall instruction throughput in a processing element is vectorization or vector processing. Vector processing, such as the use of single instruction multiple data (SIMD) instructions exploit data level parallelism, performing the same operation on multiple data simultaneously. One example SIMD instruction set extension is the VMX (sometimes referred to as “Altivec”) extension provided by International Business Machines Corporation of Armonk, N.Y. In some implementations, vector instructions are processed by separating a single 2N-bit wide operand into two separate N-bit operands executed utilizing a “half-pumped” execution technique whereby the operands are executed in two subsequent clock signal cycles, with the two results being concatenated following completion of the second N-bit operand or “slice” to form a complete result. Using such a half-pumped execution technique causes a vector instruction to complete in two clock signal cycles rather than the typical one clock signal cycle required for scalar instruction execution.
FIG. 4 depicts the processing element of FIG. 1 extended to support half-pumped execution of vector (SIMD) words, where the SIMD words or slices each have half the width of the full data width 2N. The illustrated processing element operates in a substantially similar manner to that depicted in FIG. 1. Data of each of two separate instruction operands is retrieved or applied from a register file (not shown) to each of two corresponding multiplexers 410A and 410B. Scalar 2N-bit wide operands so received and selected utilizing multiplexers 410A and 410B are applied to corresponding unpack—2N blocks 412a and 412b which convert the received data from interface to internal format, utilized in operand processing by execution unit 422. Once converted, internal format operands are stored in corresponding operand registers 420A and 420B as shown which, in the illustrated embodiment, are implemented as multiplexer (MUX) latches as described herein. A corresponding pack—2N block 424 converts the scalar execution result data from internal to 2N-bit wide interface format. Scalar results generated by execution unit 422 may then be distributed across N-bit result registers 428A and 428B from which a concatenated result, upon selection utilizing a global result multiplexer 430, may be reapplied to the register file as instruction processing completes and/or applied to the operand registers of the same or another processing element as shown. Result forwarding may be implemented via either of bus B 432 (in internal format) or bus A 434 (in interface format) as previously described with respect to FIG. 1.
Vector instructions are handled by the processing element of FIG. 4 utilizing a half-pumped execution technique as will now be described. As each operand associated with a vector (e.g., SIMD) instruction is received at multiplexers 410A and 410B, it is applied to additional 2N-bit to N-bit selection multiplexers 414A and 414B as well as temporary registers 416A and 416B rather than to unpack—2N blocks 412A and 412B. Multiplexers 414A and 414B are utilized to select which portion or “slice” of the vector instruction will be processed first. In the embodiment of FIG. 4, a big-endian architecture is presumed and the most-significant or “high order” operand slices represented by bits O . . . N−1 of each operand are processed first and applied to unpack N blocks 418A and 418B which convert the received data from a N-bit wide bit external or “interface” format into an internal format. After the first vector operand slice is processed as described, each 2N-bit side interface formatted operand is applied, from corresponding temporary registers 416A and 416B, via associated multiplexers 410A and 410B back to the inputs of multiplexers 414A and 414B. At the second application of each operand however, multiplexers 414A and 414B are utilized to select the least significant or “low order” operand slices represented by bits N . . . 2N−1 for unpacking and operand register storage.
Using the described half-pumped execution technique, vector slices are then applied to execution unit 422 for execution. Execution results produced by execution unit 422 are then packed using pack_N block 426 in consecutive clock cycles. Consequently, the higher order half of each result (e.g. result [0:N−1]) is available at the output of result register 428A in clock signal cycle K. The other (lower order) half (e.g. result[N:2N−1]) is available at the output of the other result register 428B in cycle K+1. The complete 2N-bit wide result of the instruction concatenated from the two separate result registers and is available via global results multiplexer 430 on bus A 434 in cycle K+1. The progression of data vector instruction operands through the processing element of FIG. 4, including the use of result forwarding buses 432 and 434 may be better appreciated when read in conjunction with the description of FIGS. 5 and 6.
FIG. 5 illustrates a timing diagram depicting early result forwarding of vector instruction slice results via an internal format bus according to the prior art. Different SIMD slices per instruction are depicted using different shading patterns. In the embodiment of FIG. 5, a striped pattern block represents a high order SIMD slice [0:N−1] and a crossed pattern block represents a low order SIMD slice [N:2N−1]. More specifically, FIG. 5 illustrates vector instruction execution with an issue-to-issue latency interval of K+1 cycles, where the result of the first instruction INSTR 1 is sent in an external interface register file format via bus A 434 to be utilized in the execution of a dependent second instruction, INSTR 2. FIG. 6 illustrates a timing diagram depicting early result forwarding of vector instruction slice results via an interface format bus according to the prior art. Per the timing diagram shown, the data path of FIG. 4 supports an issue-to-issue interval of K−1 cycles using bus B in two subsequent cycles to forward internal format result slices at the conclusion of half-pumped execution.
Modern processing element designs however must also support an issue-to-issue interval of K cycles if the smallest issue-to-issue-latency is K−1 cycles to avoid increased instruction sequencer complexity. Interface format bus A 434 of FIG. 4 is only available in a clock signal cycle when a generated result is sent to an associated register file and therefore cannot be used in consecutive clock cycles to achieve K-cycle issue-to-issue latency. Conventional processing elements may therefore either elect to provide for K+1 cycle issue-to-issue latency alone or include an additional result forwarding bus dedicated for supporting K-cycle latency with the former solution suffering from reduced performance and the latter solution adding complexity and power consumption to a design if the requisite wiring resources and physical real estate are available.