Digital processing of analog signals is critical to many important commercial applications, including such diverse fields as telecommunication networks, audio and video presentation devices, and computer controlled systems. Such applications typically utilize classic time-invariant algorithms, such as digital filtering and Fourier transforms. Although differing in their implementation details, these algorithms share a common characteristic: dependence upon a basic mathematical operation—the multiply and accumulate (“MAC”). In a “MAC operation”, a first data operand is multiplied by a second data operand, and the product is added to the current contents of an “accumulator”. In most such applications, the speed with which a MAC operation is performed is considered critical.
If the data operands are themselves simply elements of data operand “vectors”, as is often the case, each MAC operation requires pre-loading of an appropriate pair of operands using respective access address “pointers” into the data vectors, and then post-modification of each of the pointers according to a specific address access pattern. Typically, the access patterns are different for each of the data vectors. In some applications, one (or both) of the data vectors may be too large to fit into available system memory at one time, thus requiring further overhead to move each over-sized vector through a conveniently sized “buffer” which is allocated in either system or local memory. In general, each buffer is specified in terms of a starting “base address” and a “modulo” length, and the operands in that buffer are accessed according to an access pattern having a particular step “offset” size. In many algorithms, at least one of the buffers is accessed in a modulo manner, wherein a pointer that steps beyond the end of the buffer is wrapped, modulo the length of the buffer, back into the buffer. For the purpose of the description that follows, I will use the term “circular buffer” to refer to any memory-based data buffer which is accessed in such a modulo manner, regardless of whether or not the size of the buffer is less than or equal to the size of the data vector which may be stored therein.
In general, it is the presence of an execution unit (“EU”) especially designed to efficiently perform an atomic MAC operation that distinguishes a digital signal processor (“DSP”) from a general purpose digital data processor. In view of the importance of timely supplying the MAC EU with operands, many DSPs incorporate a pair of special purpose data address generators (“DAGs”) to assist the load/store unit (“LSU”) in supplying operands to the MAC EU. In such DSPs, a single atomic “MAC instruction” may be provided to allow a programmer to specify both the details of the MAC operation and, via special purpose registers, the characteristics of each of the operand access patterns.
It has occurred to me that application of conventional microprocessor design concepts to DSPs should prove beneficial for numerous reasons. First, the majority of DSP algorithms involve loops. Second, DSP algorithms tend to be computationally intensive. Third, DSP application code is usually relatively small, with relatively few conditional branches, thus reducing the control logic required for branch prediction. Fourth, many modern DSPs have dedicated hardware for loop operations. Finally, the results of such operations are often only interim results which are consumed within the loop and never used again, thus reducing register pressure and traffic through the LSU.
Shown in FIG. 1 is a 5-stage pipeline typical of a simple in-order issue microprocessor or microcontroller. In such a pipeline, all instructions traverse all stages and may encounter an “interlock” or “stall” at each stage boundary, depending upon a number of conditions, including, for example, the state of the instructions further down the pipeline, current unavailability of necessary hardware resources, or register conflicts (typically referred to as “hazards”). Shown in FIG. 2 is a deeper 7-stage pipeline typical of a classic out-of-order (“OoO”) issue microprocessor, such as the “Alpha”, originally designed by engineers working for the Digital Equipment Company. In this pipeline, after the OoO Issue stage, multiple instructions may be “in flight”, but regardless of the actual time of “completion”, all will be “retired” in strict program order. In such a machine, the LSU can be represented as a “side pipe” which is “tightly coupled” to the main pipe, in that its operation is fully interlocked with respect to the main pipeline (illustrated in FIG. 2 by a heavy dashed line between stages 4 and 5 of the main pipe flow). Thus, although most instructions experience only Seven (7) stages of “latency”, loads and stores have a latency of Nine (9).
In general, any pipe stage capable of generating a “synchronous exception” (including both “traps”, such as an overflow trap, and “faults”, such as a page fault) must be interlocked. For convenience of reference, I shall hereafter refer to such stages as “interlockable stages”. One disadvantage of interlocking is that, once a particular interlockable stage is interlocked, instructions traversing the preceding stages in the pipeline will back up pending release of the interlock. One partial solution is, as in the Alpha, to provide an OoO issue mechanism. However, the amount of hardware necessary to support OoO operation is quite substantial, and, in any event, is still constrained to a limited number of instructions simultaneously in-flight in the EUs. In effect, the bottleneck is simply shifted (at significant expense in hardware, power, etc.) from the issue stage to a common write-back/retire stage. It has occurred to me, however, that an instruction can be defined such that, after a certain point in the pipe, it is, by definition, incapable of generating synchronous exceptions. It is therefore theoretically possible to “retire” such an instruction any time after that instruction has traversed the last stage in the pipe at which that instruction can provoke an interlock! By “decoupling” such an instruction from the strict in-order retirement portion of the pipeline, subsequent instructions in program order can also be retired before the decoupled instruction has actually completed execution. Of course, if a decoupled instruction is such that it will ultimately deliver a result back to an architectural register, then the register allocation and update mechanism must make appropriate allowances to detect and resolve data hazards (including “read after write”, “write after read”, and “write after write”). In general, however, the write-back time slot allocation mechanism should not need significant modification since the total amount of traffic and the number of competing sources will still be the same. What is really needed is a mechanism for “decoupling” an instruction such that it is eligible for retirement as soon as it is no longer capable of generating any synchronous exceptions.
For the purpose of making relative performance comparisons in the description that follows, I shall estimate circuit performance in terms of “units of delay”, wherein I define One (1) unit of delay as the time required for an input signal to traverse a typical 3-input NAND gate and settle to the correct output logic level at the input of the next level of logic. I will assume that such a typical gate would be implemented, using current state of the art semiconductor manufacturing processes, as a single, contiguous physical unit or cell of minimal sized transistors with minimum inter-transistor wiring. In all estimates that I shall make herein, I will also assume that, within each discrete functional unit, such as an adder, all requisite gates comprise a single, contiguous physical unit or super-cell so as to minimize inter-gate wiring.
In the field of general purpose digital data processors, it has been demonstrated that considerable improvement in performance can be achieved by employing a very deep pipeline, on the order of Twelve (12) stages or more, and increasing the clock rate accordingly. In general, the clock rate is determined by the pipeline stage having the longest critical speed path. In high performance processors, careful attention is given to partitioning the pipeline so as to balance the relative speed paths through each stage. A significant imbalance may indicate the desirability of splitting that stage into multiple stages or of augmenting that stage with additional hardware resources. In either case, the consequences on relative cost to performance must be considered.
In a modem deeply pipelined microprocessor, such as the Alpha, the clock-cyde-limiting pipe stage is considered to consist of an input latch, a minimum arithmetic logic unit (“ALU”) operation, and result forwarding back to the input latch, requiring about Eleven (11) delay units using current state of the art design techniques. Such a design allows single-cyde ALU forwarding, while achieving high clock frequency rates. It is also dose to the minimum time required to drive and sample a state of the art memory array, such as a 64×64 static random access memory (“SRAM”) array.
In modern DSPs, the longest stage of the processing “pipeline” is the single-cycle MAC EU, and the dock rate is set accordingly. Shown in FIG. 3 is a 6-stage pipeline typical of such a conventional DSP, wherein the MAC can be represented as a tightly-coupled side pipe. Using current state of the art logic design, the critical speed path through a MAC EU is approximately Forty (40) delay units. Using a state of the art 0.18 micron manufacturing process, One (1) delay unit is approximately One Hundred (100) picoseconds Thus, the maximum clock rate for such a design would be on the order of Two Hundred Fifty (250) MHz. In contrast, the critical speed path through a current state of the art DAG is approximately Twenty (20) delay units. Since the DAG is already twice as fast as it needs to be to keep up with the MAC EU, there has been little incentive to improve its performance, particularly since such improvement would come only at the cost of additional hardware, power consumption, waste heat, etc.
If contemporary deep pipelining design techniques could be effectively applied to the MAC in a DSP, one might expect to realize commensurate improvement in system performance. However, just deeply-pipelining the MAC is not sufficient to achieve the desired 11-delay-unit clock cycle:, the clock-cycle-limiting stage is now the DAG! I have shown a DAG capable of sustained operation at an 11-delay-unit clock cycle in my co-pending U.S. application Ser. No. 90/537,020, entitled “Modulo Address Generation Method and Apparatus”, filed contemporaneously herewith and incorporated herein by reference (“Co-pending Application”). Accordingly, a need now exists for a deeply pipelined MAC, the longest stage of which has a critical speed path on the order of Eleven (11) delay units.