The present invention relates generally to data address generation in a digital data processing system, and, in particular, to a data address generator which generates modulo addresses for addressing data operands stored in a circular buffer.
Digital processing of analog signals is critical to many important commercial applications, including such diverse fields as telecommunication networks, audio and video presentation devices, and computer controlled systems. Such applications typically utilize classic time-invariant algorithms, such as digital filtering and Fourier transforms. Although differing in their implementation details, these algorithms share a common characteristic: dependence upon a basic mathematical operationxe2x80x94the multiply and accumulate (xe2x80x9cMACxe2x80x9d). In a xe2x80x9cMAC operationxe2x80x9d, a first data operand is multiplied by a second data operand, and the product is added to the current contents of an xe2x80x9caccumulatorxe2x80x9d. In most such applications, the speed with which a MAC operation is performed is considered critical.
If the data operands are themselves simply elements of data operand xe2x80x9cvectorsxe2x80x9d, as is often the case, each MAC operation requires pre-loading of an appropriate pair of operands using respective access address xe2x80x9cpointersxe2x80x9d into the data vectors, and then post-modification of each of the pointers according to a specific address access pattern. Typically, the access patterns are different for each of the data vectors. In some applications, one (or both) of the data vectors may be too large to fit into available system memory at one time, thus requiring further overhead to move each over-sized vector through a conveniently sized xe2x80x9cbufferxe2x80x9d which is allocated in either system or local memory. In general, each buffer is specified in terms of a starting xe2x80x9cbase addressxe2x80x9d and a xe2x80x9cmoduloxe2x80x9d length, and the operands in that buffer are accessed according to an access pattern having a particular step xe2x80x9coffsetxe2x80x9d size. In many algorithms, at least one of the buffers is accessed in a modulo manner, wherein a pointer that steps beyond the end of the buffer is wrapped, modulo the length of the buffer, back into the buffer. For the purpose of the description that follows, I will use the term xe2x80x9ccircular bufferxe2x80x9d to refer to any memory-based data buffer which is accessed in such a modulo manner, regardless of whether or not the size of the buffer is less than or equal to the size of the data vector which may be stored therein.
In general, it is the presence of an execution unit (xe2x80x9cEUxe2x80x9d) especially designed to efficiently perform an atomic MAC operation that distinguishes a digital signal processor (xe2x80x9cDSPxe2x80x9d) from a general purpose digital data processor. In view of the importance of timely supplying the MAC EU with operands, many DSP""s incorporate a pair of special purpose data address generators (xe2x80x9cDAGsxe2x80x9d) to assist the load/store unit (xe2x80x9cLSUxe2x80x9d) in supplying operands to the MAC EU. In such DSP""s, a single atomic xe2x80x9cMAC instructionxe2x80x9d may be provided to allow a programmer to specify both the details of the MAC operation and, via special purpose registers, the characteristics of each of the operand access patterns.
It has occurred to me that application of conventional microprocessor design concepts to DSPs should prove beneficial for numerous reasons. First, the majority of DSP algorithms involve loops. Second, DSP algorithms tend to be computationally intensive. Third, DSP application code is usually relatively small, with relatively few conditional branches, thus reducing the control logic required for branch prediction. Fourth, many modern DSPs have dedicated hardware for loop operations. Finally, the results of such operations are often only interim results which are consumed within the loop and never used again, thus reducing register pressure and traffic through the LSU.
For the purpose of making relative performance comparisons in the description that follows, I shall estimate circuit performance in terms of xe2x80x9cunits of delayxe2x80x9d, wherein I define One (1) unit of delay as the time required for an input signal to traverse a typical 3-input NAND gate and settle to the correct output logic level at the input of the next level of logic. Using a state of the art 0.18 micron manufacturing process, One (1) delay unit is approximately One Hundred (100) picoseconds. I will assume that such a typical gate would be implemented as a single, contiguous physical unit or cell of minimal sized transistors with minimum inter-transistor wiring. In all estimates that I shall make herein, I will also assume that, within each discrete functional unit, such as an adder, all requisite gates comprise a single, contiguous physical unit or super-cell so as to minimize inter-gate wiring.
In modern DSP""s, the longest stage of the processing xe2x80x9cpipelinexe2x80x9d is the single-cycle MAC EU. Using current state of the art logic design, the critical speed path through a MAC EU is approximately Forty (40) delay units. Thus, the maximum clock rate for such a design would be on the order of Two Hundred Fifty (250) MHz. In contrast, the critical speed path through a current state of the art DAG is approximately Twenty (20) delay units. Since the DAG is already twice as fast as it needs to be to keep up with the MAC EU, there has been little incentive to improve its performance, particularly since such improvement would come only at the cost of additional hardware, power consumption, waste heat, etc.
In the field of general purpose digital data processors, it has been demonstrated that considerable improvement in performance can be achieved by employing a very deep pipeline, on the order of Twelve (12) stages or more, and increasing the clock rate accordingly. In high performance processors, careful attention is given to partitioning the pipeline so as to balance the relative speed paths through each stage. A significant imbalance may indicate the desirability of splitting that stage into multiple stages or of augmenting that stage with additional hardware resources. In either case, the consequences on relative cost to performance must be considered.
In a modern deeply pipelined microprocessor, such as the xe2x80x9cAlphaxe2x80x9d (originally designed by engineers working for the Digital Equipment Company), the theoretical clock-cycle-limiting pipe stage is considered to consist of an input latch, a minimum arithmetic logic unit (xe2x80x9cALUxe2x80x9d) operation, and result forwarding back to the input latch, requiring about Eleven (11) delay units using current state of the art design techniques. Such a design allows single-cycle ALU forwarding, while achieving high clock frequency rates. It is also close to the minimum time required to drive and sample a state of the art memory array, such as a 64xc3x9764 static random access memory (SRAM) array. If such design techniques could be effectively applied to the MAC in a DSP, one might expect to realize commensurate improvement in system performance. However, just deeply-pipelining the MAC is not sufficient to achieve the desired 11-delay-unit clock cycle: the clock-cycle-limiting stage is now the DAG!
FIG. 1 illustrates a prior art data address generator (DAG 2) adapted for use in a DSP processor (not shown) having at least One (1), memory resident, data operand buffer (not shown), the location and size of which are specified by a base address (xe2x80x9cBxe2x80x9d) and a length (xe2x80x9cLxe2x80x9d), stored in respective registers (not shown). The single-stage DAG 2 is constructed to generate, each clock cycle, an index pointer (xe2x80x9cIxe2x80x9d) to the next operand in the buffer as a function of B, L, and an offset (xe2x80x9cMxe2x80x9d). In operation, the index pointer, I, steps through the buffer in increments of M. When I steps beyond the end of the buffer, i.e. where I is greater than (B+L), L is subtracted from I so that I wraps back, modulo L, to a valid address inside the buffer. Such a modulo address generation method can be described by the following algorithm, illustrated in the form of pseudocode:
for (a=0; a less than LoopCount; a++)
{
if ((I+M) less than (B+L))
Ia+1=(Ia+M);
else
Ia+1=((Ia+M)xe2x88x92L);
}
where:
a is the loop counter;
LoopCount is the number of interations of the loop;
B is the base address of the circular buffer;
M is the step size;
L is the length of the circular buffer;
Ia is the current pointer; and
Ia+1 is the next pointer.
As shown in FIG. 1, the DAG 2 has three parallel computation paths: a sequential pointer path 4 which assumes that the next I will still be inside the buffer; a modulo correction pointer path 6 which assumes that the next I will be outside the buffer and thus must be modulo wrapped back into the buffer; and a pointer selection path 8 that decides which of the two assumptions is correct and controls a pointer select MUX 10 as appropriate. In normal operation, the initial and subsequent values for I are gated in via an input MUX 12, and the values for B, L and M are provided by respective registers (not shown). Note that the initial value for I need not be B, but may be any desired value so long as it lies between B and (B+L), inclusive. In a typical implementation, the sequential pointer path 4 is comprised of a carry-propagate-adder (CPA 14) which adds M to the last I, and provides a sequential I, i.e., (I+M), to the pointer select MUX 10. The modulo correction pointer path 6 is typically comprised of a carry-save-adder (CSA 16) and a carry-propagate-adder (CPA 18), which, together, add M to the last I, and, simultaneously, subtract L, and provide a modulo corrected I, i.e., (I+Mxe2x88x92L), to the pointer select MUX 10. The pointer selection path 8 is comprised of a carry-save-adder (CSA 20) and a carry-propagation-adder (CPA 22), which, together, subtract the sequential pointer limit, i.e., (B+L) from the sequential I, i.e., (I+M), and provide the sign of the difference, i.e., (I+M)xe2x88x92(B+L), to the pointer select MUX 10. In operation, a negative sign indicates that the sequential I is correct, while a positive sign indicates that the modulo corrected I is correct. At an appropriate time, the output of the pointer select MUX 10 is forwarded to the register file (not shown), and simultaneously fed back to CPA 14, CSA 16, and CSA 20, via the MUX 12.
Using state of the art design techniques, the single-cycle prior art DAG 2 of FIG. 1 has a critical speed path of about Seventeen (17) delay units: One (1) delay unit through the MUX 12; Two (2) delay units through the inter-stage latches (not shown) that would typically be provided on the inputs of CPA 14, CSA 16, and CSA 20; Four (4) delay units through each of the conventional CSAs; Eight (8) delay units through each of the conventional CPAs; One (1) delay unit through the pointer select MUX 10; and One (1) delay unit to account for the usual interconnect wiring. Note that the presence of the CSA""s earlier in the logic flow path constrains the designer to use slower, static designs for at least CPA 18 and CPA 22. If, in a DSP having a DAG such as that shown in FIG. 1, the MAC operation could be somehow deeply pipelined so that the longest pipe stage has a critical speed path of less than Seventeen (17) delay units, the maximum clock rate for the DSP would then be limited by the speed of the DAG itself.
I have invented just such a deeply-pipelined DSP, as can be seen in my co-pending U.S. Application Ser. No. 09/536,656, entitled xe2x80x9cPipelined Processor Having Loosely Coupled Side Pipesxe2x80x9d, filed simultaneously herewith and incorporated herein by reference (xe2x80x9cCo-pending Applicationxe2x80x9d). If the full benefits inherent in partitioning the MAC so as to meet the 11-delay-unit-per-clock-cycle goal are to be realized, the speed of the DAG must be significantly improved. Therefore, a need exists for an improved method for modulo address generation, and for a modulo address generator which practices that method. To distinguish my improved design from prior art DAGs, I will hereafter refer to it as a xe2x80x9cmodulo address generatorxe2x80x9d or xe2x80x9cMAGxe2x80x9d.