The timing performance of any system can be judged by one of two measures: latency or throughput. The delay from an input to the resulting output is called the latency, and most real world problems desire this delay to be minimized. If a system can have several computations in progress at once, then the minimum delay between two successive inputs determines the throughput, which is the maximum data rate at which the system can accept requests for computation. Performance assessed by either of these measures depends on the sum of the raw propagation delay through the combinational logic of the desired function plus "other" overhead delays. From a theoretical point of view, the fastest circuit would eliminate all overheads and have circuit delays due to only the raw combinational logic. The innovations in this patent reduce the latency overhead in a pipeline to zero. Hence, the ZOSTIL innovation will produce functions whose latency attains the theoretical lower bound, but without requiring the large and costly area of a full combinational array.
Traditional synchronous circuit design techniques separate combinational logic from data storage. That is, storage is provided by explicit latches interposed between sections of combinational logic. This design technique has at least four sources of overhead which increase circuit latency: 1) propagation delay through latches; 2) margin added to tolerate clock skew; 3) wasted time in fast stages within the system; 4) maximizing data-dependent delay; and 5) the assumption of worst case timing of components.
The first source of latency overhead is due to latches because they introduce additional delays due to their set-up time and propagation delays. The minimum cycle time of a synchronous circuit is the sum of the latch set-up time, latch propagation delay, and maximum combinational logic delay. The first innovation in the ZOSTIL methodology is to remove this overhead completely by removing the explicit latches altogether and making use of the "free" half-latch at the output of each stage in a CMOS domino chain.
The second source of latency overhead comes from needing to distribute the clock to all latches in the system. Communicating stages must be in agreement as to when the clock edges occur, but wire or driver delays cause clock skew which must be compensated for by adding some margin to the total clock period. This added margin is also overhead. Previous asynchronous design techniques used handshaking blocks to remove global clocks and the extra latency overhead due to clock skew by communicating data validity locally instead of globally. But these previous techniques include explicit latches, and hence, still had the latency overhead due to latch propagation delays. Previous techniques also added some overhead due to the forward directed paths within the handshaking logic. The second ZOSTIL innovation is to insure all control paths operate in parallel with the forward evaluation rather than adding sequentially to the path.
The third source of latency overhead is due to mismatching of the functional sections between the latches. Because the amount of time in a clock period is fixed, it must be set equal to the longest propagation delay of all of the different functional sections in the system. The difference between that maximum and the actual time used by any functional section is overhead because it is wasted time. A self-timed dataflow does not waste this time because it allows data to flow forward based on data-driven local control, rather than waiting for clock edges. Although the throughput of a pipeline is still limited by its slowest stage, the latency is improved by letting each state progress as soon as it can.
The fourth source of latency overhead comes from determining critical paths in synchronous logic based on the worst-case data values. If there is a large variance then there is a large performance loss due to the difference between the average and maximum values of delay. Synchronous designers try to adjust transistor sizing to equalize the various paths through a body of logic, but in self-timed systems it is desired to minimize the probabilistic expected value of the delay rather than minimizing the maximum delay. The third innovation of this patent is to make use of any known probabilistic distribution of the inputs of each block of logic in order to size the transistors in that block to minimize the expected value of the total delay.
The fifth source of latency overhead is the derating used to insure performance over a range of temperature and voltage levels. Synchronous system design must always be based on conservative derated "worst-case" specifications because the system must work at the environmental extremes. But when the actual conditions are not at the extremes, the difference between the possible performance and the actual designed performance is wasted performance. Self-timed components will always run at their maximum speed for the existing conditions and deliver their outputs as soon as they are actually finished. By providing completion indication, they allow an enclosing system to make use of the output sooner than always waiting for the case.