The present embodiments relate to electronic processors such as microprocessors or digital signal processors (“DSPs”), and are more particularly directed to improving performance and efficiency through selectively forwarding data from forwarding multiplexers to no more than half of the sub-units in the processor functional unit(s).
Electronic processors including microprocessors and DSPs have become prevalent in numerous applications, including uses for devices in personal, business, and other environments. Demands of the marketplace affect many aspects of processor design, including factors such as device power consumption and speed. As a result, constant evaluation is performed by processor designers in an effort to provide improvements in these and other factors. The present embodiments also endeavor toward this end.
By way of additional background, FIG. 1 illustrates a block diagram of a prior art processor 10. For the sake of simplifying the illustration and following discussion, only certain aspects of processor 10 are shown, as providing a basis by which to appreciate the contrast between the prior art and the preferred embodiments described later. Thus, any simplification in FIG. 1 is not by way of limitation, and one skilled in the art will readily appreciate that processors typically include numerous other attributes. Turning specifically to processor 10, it includes what is referred to herein and often in the art as a functional unit 12. As also discussed later, contemporary processors may include multiple functional units in the same general configuration as the single functional unit 12 shown in FIG. 1, but the illustration of only a single such unit should suffice to provide an adequate discussion. Functional unit 12 includes what are referred to herein as sub-units 14x, and in the illustration these includes four sub-units 141, 142, 143, and 144. As known in the art, processor 10 has an associated instruction set, and each of sub-units 14x operates to perform data operations with respect to one or more data arguments according to any one of the instructions in that instruction set. FIG. 1 illustrates various examples in this regard. First, sub-unit 141 is shown as an arithmetic logic unit (“ALU”). Thus, sub-unit 141 is operable, among other things, to perform an add operation corresponding to the add instruction in the instruction set. Second, FIG. 1 illustrates sub-unit 142 to be a logical operator block that is operable to perform various logical operations, such as are known in the art to include by ways of example logical AND, OR, and so forth. Third, sub-unit 143 is shown as a shifter unit. Thus, sub-unit 143 is operable to receive as one argument an input word, and sub-unit 143 shifts that data word a number of bit locations, where the number of bit shift locations is provided by another argument input to sub-unit 143; additionally, in some instances a shifter is operable to perform other two-argument operations. Fourth, sub-unit 144 is shown as a normalizer. Thus, sub-unit 144 is operable to identify the number of leading zeroes or ones in the one word input argument received by sub-unit 144.
Looking now to the connections in processor 10 and beginning in functional unit 12, it includes two forwarding multiplexers 16A and 16B, which are named as such for reasons detailed later. For the sake of reference, let the data provided from the output of forwarding multiplexer 16A be indicated as “A” and the data provided from the output of forwarding multiplexer 16B be indicated as “B,” as also represented by the subscript for each forwarding multiplexer. A control input is connected to forwarding multiplexer 16A so that prior to an execution cycle of functional unit 12, a signal FWD_A causes multiplexer 16A to select one of its inputs, which in the example of FIG. 1 consist of two inputs, and connect the data at the selected input to its output. Similarly, a control input is connected to forwarding multiplexer 16B so that prior to an execution cycle of functional unit 12, a signal FWD_B at that control input causes multiplexer 16B to select one of its inputs, which in the example of FIG. 1 consist of two inputs, and connect the data at the selected input to its output. The respective data outputs, A and B, of forwarding multiplexers 16A and 16B, are connected to data input registers 18A and 18B, respectively, where data input registers 18A and 18B operate according to known principles whereby an enabling clock symbol, CLK, causes the data at each such register to be stored in the register and to be presented at the register output.
The outputs of data input registers 18A and 18B are connected to various different ones of the sub-units 141 through 144, where the specific connections are typically based on the number and/or type of argument(s) required by each sub-unit. For example, with respect to sub-unit 141 and as shown as an example to be an ALU, such an ALU typically requires two arguments. Thus, one of these arguments is data A provided by data input register 18A and the other of these arguments is data B provided by data input register 18B. Similarly, each of sub-units 142 and 143, as shown in the respective examples as a logic operator block and a shifter, also requires two data arguments; thus, each such sub-unit is shown to receive one argument from each of data input registers 18A and 18B. In addition and as also illustrated, sub-unit 143, for the example where it is a shifter, also may receive the data A as an additional argument to indicate the number of shifts to be achieved. Lastly, sub-unit 144 in the example illustrated requires only a single input argument. In processor 10, this single argument is provided to sub-unit 144 from data input register 18B. The output of each sub-unit 141 through 144 is connected as an input to an output multiplexer 20; further, and for reasons discussed later, the output of output multiplexer 20 is connected as an input A1 to forwarding multiplexer 16A and as an input B1 to forwarding multiplexer 16B. The output of output multiplexer 20 is connected to an input of a register 21, and the output of register 21 is connected to an input to a register file 22. Register file 22 typically has a read port for each forwarding multiplexer and, thus, in the illustrated example, register file 22 has a read port 222 connected to an input A2 of forwarding multiplexer 16A and a read port 222 connected to an input B2 of forwarding multiplexer 16B. Lastly, processor 10 may include multiple functional units in the same general configuration as the single functional unit 12 shown in FIG. 1. In such a case of multiple functional units such as in the case of two functional units, then typically the outputs of the sub-units from a first functional unit are also connected to forwarding multiplexers in the second functional unit, and vice versa. Accordingly, while functional unit 12 is illustrated as a single functional unit and its forwarding multiplexers have two inputs (either A1 and A2 or B1 and B2), where additional functional units are included then each forwarding multiplexer may include additional inputs. In any event, the present illustration indicates only a single functional unit 12 so as to simplify the remaining discussion.
The operation of processor 10 is now described and further in view of the timing diagram of FIG. 2 and a few pseudo instruction examples. Looking first to a pseudo instruction example, consider the case that register file 22 includes at least six registers R0 through R5, and assume that the following two pseudo instructions are to be executed:ADD R0,R1→R2  Instr. 1ADD R3,R4 R5  Instr. 2One skilled in the art will recognize that instruction 1 adds the contents of registers R0 and R1 and stores the result to register R2, and instruction 2 adds the contents of registers R3 and R4 and stores the result to register R5.
Looking now to the timing diagram of FIG. 2, the operation of processor 10 relative to instruction 1 is now explored in greater detail. By way of introduction, in the preferred embodiment each instruction requires multiple clock cycles to perform the instruction and preferably a new instruction commences for each clock cycle such that the time to fully complete a first instruction overlaps with one or more clock cycles of completing a second instruction that immediately-follows the first instruction; in this way, the instructions are said in the art to be pipelined and therefore FIG. 2 is sometimes referred to in the art as a pipeline timing diagram. Specifically looking to FIG. 2, it illustrates three clock cycles of operation with respect to each of instructions 1 and 2, where as shown below a portion of those operations overlap in time, thereby illustrating the pipelining as introduced above. For reasons explored later, FIG. 2 also indicates the relative timing of two additional instructions 3 and 4, which generally follow the same timing as instructions 1 and 2, but which provide for different observations made later. Looking first to the case of instructions 1 and 2, at time t0, instruction 1 reads its arguments R0 and R1; particularly, the FWD_A signal is set such that forwarding multiplexer 16A reads one of these arguments as data A from read port 221 of register file 22, and at the same time the FWD_B is set such that forwarding multiplexer 16B reads the other of these arguments as data B from read port 222 of register file 22. By the end of the read cycle, that is, by time t1, these two arguments are clocked into data input registers 18A and 18B, respectively. Continuing with instruction 1, from time t1 to time t2, sub-unit 141 executes the intended instruction 1, which is an ADD. Thus, by the completion of the instruction 1 execute cycle at time t2, the sum produced by the ADD is available at the output of sub-unit 141, selected by output multiplexer 20, and also is latched into register 21. Finally, during the write cycle of instruction 1 as completed by time t3, the value in register 21 (i.e., the sum from the ADD instruction) is stored into register R2 of register file 22.
The operation of processor 10 relative to instruction 2 is now explored in greater detail and, as introduced above, is shown to overlap in part the operation relative to instruction 1. At time t1, instruction 2 reads its arguments R3 and R4, where like instruction 1 this is achieved by setting FWD_A and FWD_B so that forwarding multiplexers 16A and 16B read respective arguments data A and data B from read ports 221 and 222 of register file 22, and those arguments are clocked into data input registers 18A and 18B, respectively. Note also that this read cycle coincides with the execute cycle of instruction 1, that is, the access of data for instruction 2 is occurring at the same time as is the processing (e.g., adding) of data for instruction 1. Next, from time t2 to time t3, sub-unit 141 executes the intended instruction 2 which again is an ADD; here, however, the add is with respect to the instruction 2 arguments from registers R3 and R4. Thus, by the completion of the instruction 2 execute cycle at time t3, the sum produced by the ADD is available at the output of sub-unit 141, selected by output multiplexer 20, and latched into register 21. Finally, during the write cycle of instruction 2 as completed by time t4, the sum from register 21 is stored into register R5 of register file 22.
Additional operation of processor 10 is now described, with an example that presents the basis for the reference of multiplexers 16A and 16B as “forwarding” multiplexers. Particularly, consider now an example where the following two pseudo instructions are to be executed:ADD R0,R1→R2  Instr. 3ADD R2,R4→R5  Instr. 4Instruction 3 is the same as instruction 1 and, hence, the reader is assumed familiar with the level of detail described above. Instruction 4, however, adds the contents of registers R2 and R4 and stores the result to register R5, where note therefore that the argument R2 in instruction 4 requires the result of the immediately-preceding instruction 3. This requirement is sometimes referred to as a hazard, for reasons further detailed below.
Looking again to the timing diagram of FIG. 2 and now with respect to instructions 3 and 4, instruction 3 proceeds in the same manner as described above with respect to instruction 1. Looking then to instruction 4, between time t1 and time t2 it requires its arguments, which are to be from registers R2 and R4. However, note that as of time t2, instruction 3 has not yet written its result into register file 22. Thus, from register file 22, this result, as the needed argument for instruction 4, is not available. As a result, this contingency is sometimes referred to as a hazard because, without additional consideration taken, there is the potential hazard of reading the non-updated value to be written out of register R2. Given this possibility, recall that the output of each sub-unit 14x is not only connected as an input to multiplexer 20, but it is also returned as an input to each forwarding multiplexer 16x. With this added connection and continuing with FIG. 2 and instruction 4, from time t1 to time t2, forwarding multiplexer 16A “forwards” the output of sub-unit 141 directly back to data input register 18A, from where instruction 4 reads the argument; thus, the result of instruction 3, which is to be written to register R2 from time t2 to time t3, is before time t2 forwarded and read by instruction 4. It is for this reason that multiplexer 16A is termed a forwarding multiplexer, and this reasoning applies likewise to the similar connectivity and operability of forwarding multiplexer 16B. Lastly, completing the operation of processor 10 with respect to instruction 4, from time t3 to time t4, the result of instruction 4 is written, via multiplexer 20, into register R5 of register file 22.
While the preceding structure and method of operation for processor 10 has provided useful and beneficial in many devices and applications, the present inventors has recognized that it has various limitations and drawbacks. More particularly, as stated earlier, factors such as device power consumption and speed are critical in contemporary processor design. Toward this end, the present inventor has recognized that these parameters in connection with processor 10 may be improved upon, as is provided by the preferred embodiments described below.