The timing performance of any system can be judged by one of two measures: latency or throughput. The delay from an input to the resulting output is called the latency, and most real world problems desire this delay to be minimized. If a system can have several computations in progress at once, then the minimum delay between two successive inputs determines the throughput, which is the maximum data rate at which the system can accept requests for computation. Performance assessed by either of these measures depends on the sum of the raw propagation delay through the combinational logic of the desired function plus "other" overhead delays. From a theoretical point of view, the fastest circuit would eliminate all overheads and have circuit delays due to only the raw combinational logic. The innovations in this patent reduce the latency overhead in a pipeline to zero. Hence, the ZOSTIL innovation will produce functions whose latency attains the theoretical lower bound, but without requiring the large and costly area of a full combinational array.
Traditional synchronous circuit design techniques separate combinational logic from data storage. That is, storage is provided by explicit latches interposed between sections of combinational logic. This design technique has at least four sources of overhead which increase circuit latency: 1) propagation delay through latches; 2) margin added to tolerate clock skew; 3) wasted time in fast stages within the system; 4) maximizing data-dependent delay; and 5) the assumption of worst case timing of components.
The first source of latency overhead is due to latches because they introduce additional delays due to their set-up time and propagation delays. The minimum cycle time of a synchronous circuit is the sum of the latch set-up time, latch propagation delay, and maximum combinational logic delay. The first innovation in the ZOSTIL methodology is remove this overhead completely by removing the explicit latches altogether and making use of the "free" half-latch at the output of each stage in a CMOS domino chain.
The second source of latency overhead comes from needing to distribute the clock to all latches in the system. Communicating stages must be in agreement as to when the clock edges occur, but wire or driver delays cause clock skew which must be compensated for by adding some margin to the total clock period. This added margin is also overhead. Previous asynchronous design techniques used handshaking blocks to remove global clocks and the extra latency overhead due to clock skew by communicating data validity locally instead of globally. But these previous techniques include explicit latches, and hence, still had the latency overhead due to latch propagation delays. Previous techniques also added some overhead due to the forward directed paths within the handshaking logic. The second ZOSTIL innovation is to insure all control paths operate in parallel with the forward evaluation rather than adding sequentially to the path.
The third source of latency overhead is due to mismatching of the functional sections between the latches. Because the amount of time in a clock period is fixed, it must be set equal to the longest propagation delay of all of the different functional sections in the system. The difference between that maximum and the actual time used by any functional section is overhead because it is wasted time. A self-timed dataflow does not waste this time because it allows data to flow forward based on data-driven local control, rather than waiting for clock edges. Although the throughput of a pipeline is still limited by its slowest stage, the latency is improved by letting each stage progress as soon as it can.
The fourth source of latency overhead comes from determining critical paths in synchronous logic based on the worst-ease data values. If there is a large variance then there is a large performance loss due to the difference between the average and maximum values of delay. Synchronous designers try to adjust transistor sizing to equalize the various paths through a body of logic, but in self-timed systems it is desired to minimize the probabilistic expected value of the delay rather than minimizing the maximum delay. The third innovation of this patent is to make use of any known probabilistic distribution of the inputs of each block of logic in order to size the transistors in that block to minimize the expected value of the total delay.
The fifth source of latency overhead is the derating used to insure performance over a range of temperature and voltage levels. Synchronous system design must always be based on conservative derated "worst-case" specifications because the system must work at the environmental extremes. But when the actual conditions are not at the extremes, the difference between the possible performance and the actual designed performance is wasted performance. Serf-timed components will always run at their maximum speed for the existing conditions and deliver their outputs as soon as they are actually Finished. By providing completion indication, they allow an enclosing system to make use of the output sooner than always waiting for the worst case.
Background and Nomenclature for Dual-Monotonic Signals PA0 Background on Domino Logic PA0 Overview of the Innovations
If A is a dual-monotonic signal, it is be represented by two "sub-signals", called A.sup.0 and A.sup.1, with the encoding: if both of the wires are in the same logical state, say low, then the signal A has not yet evaluated; if either A.sup.0 or A.sup.1 changes state, this communicates the signal A has finished evaluating, and the state of A is determined by .noting which of the two wires changed. For Example, if both A.sup.0 and A.sup.1 have the binary value `0`, then the value of the signal, A, is not yet determined. If A.sup.1 transitions to `1`, then the value of A is `1`, while if A.sup.0 transitions to `1`, then the value of A is `0`. The pair of wires is called a dual-monotonic pair because the transitions on the wires must be monotonic during evaluation. These transitions are mutually exclusive, and either one indicates the evaluation of is complete and can be used by other circuits. In this patent, signal names are italicized, and a "*" is used to indicate logical inversion. Also, each half of a dual-monotonic signal will have a superscript of 1 or 0.
Monotonic signals can be conveniently generated by CMOS domino logic. Each signal can be in one of three functional phases: 1) precharge or reset, 2) logic evaluation, or 3) data storage. These three phases are shown in FIG. 1 which shows, respectively, a two-input dual-monotonic AND gate and its waveform diagram. During the reset phase, the active low precharge signal, P*, is active and the A and B signals must be inactive. This causes the precharged nodes X* and Y* to be high, and the Q outputs, to be low. In the logic evaluation phase, either A.sup.0 or A.sup.1 and either B.sup.0 or B.sup.1 will transition high monotonically. If both A.sup.1 and B.sup.1 transition high, the AND gate's Q.sup.1 output monotonically transitions high, and if either A.sup.0 and B.sup.0 go active, the Q.sup.0 output will go high. Outing the data storage phase, both A and B signals are forced low, and P* remains inactive. This condition leaves the precharged nodes X* and Y* undriven, and capacitance causes them to act as a memory elements so the outputs, Q.sup.1 and Q.sup.0, remain in the same state as they were during the logic evaluation phase. Thus, each domino stage includes a "free" half-latch because no additional transistors and no additional logic delays are needed to store data.
CMOS domino logic is normally used only in two phases: precharge and logic evaluation. The invention of the present patent uses a third phase to store data, which allows domino logic gates to be cascaded and pipelined without intervening latches. The inputs to this system must have strictly monotonic transitions during the logic evaluation phase and the precharge signal must be active during only the precharge phase. Furthermore, the pipelined system can feed its output back to the input to form an iterative structure. Such a feedback pipeline is viewed as a "loop" or "ring" of logic which circulates data until the entire computation is complete.
The innovation of making use of the temporary storage of a precharged function block allows the explicit latches to be omitted. Each domino stage provides the operation of a half-latch for free. The Reset Control logic operates completely in parallel with the function block evaluation. Completion detection logic in each Reset Control block observes the output of the following Function Block to determine when all of its outputs have finished evaluating and then instructs its own Function Block to move from the data storage phase to the precharge phase, driving all its outputs to the reset state. When the outputs of the following Function Block subsequently become reset, the Reset Control turns off the precharge signal for its Function Block, causing it to be ready for the data evaluation phase when its next data input actually arrives.
By encoding the data in dual-monotonic pairs, there is no forward handshake required and thus the control logic is removed from the critical path of the circuit. This innovative methodology, in conjunction with the first innovation removing the need for explicit latches, yields a truly zero overhead minimum latency delay path through pipelined logic.
The ZOSTIL technique includes combining the latch-free circuits and parallel Reset Control into an iterative structure, or "ring." This is particularly important for arithmetic operations which perform the same basic function over and over. Example of these type of functions are: multiplication, division, square root, sine, and cosine.
ZOSTIL circuits are robust because, with proper design of the control logic, they are delay-independent. That is, the circuits will function correctly regardless of the actual delays of the circuit elements. Therefore, calculations involving delays are not necessary to insure the logical correctness or functionality of the system, but are used only to estimate the performance. This contrasts to synchronous design techniques which require extensive delay calculations to insure all computations within a single logic stage can be performed in one clock cycle. Improper delay estimation may result in a synchronous circuit which does not always produce the correct result.
Division algorithms generate a quotient by successive determination of quotient digits from most significant to least significant. Because each quotient digit is used in the computation of the next partial remainder, which in turn is required to determine the next quotient digit, division is an inherently sequential process. Hence, a pipelined ting designed with the ZOSTIL technique is ideal for performing arithmetic division. An additional innovation specific to division is to overlap and interlock stages to allow two remainder computations to occur in parallel. This is accomplished by modifying an algorithm, known as SRT division, to perform several small remainder computations in parallel and choose the correct remainder when the quotient digit from the previous stage is determined. This innovation improves the overall latency by a factor of two in comparison with the previous algorithms.