Conventional Complementary Field-Effect Transistors ("CFET") logic circuits include N-channel field-effect transistors ("NFET") and P-channel field-effect transistors ("PFET"). In the following description the terms CFET, NFET, and PFET should be interpreted to include all field-effect transistor integrated circuit technologies. Metal-Oxide Semiconductor ("MOS") processes are often used to fabricate Field-Effect Transistors ("FET") logic circuits. As used in this description, the terms MOS and FET are interchangeable.
Conventional logic-circuit-design techniques contemplate increasing the throughput of a system with a "pipeline". The pipeline comprises a number of logic sections, each separated by a register section. Each system clock transition allows a "data signal" (herein also simply called "signal") to propagate from one register section, through the following logic section, and to the inputs of the following register section. Typically, new signal inputs are not fed into a logic section until the previous signal outputs are latched into the register section following that logic section. The maximum clock frequency for a logic section (i.e., the frequency with which new data can be switched into a logic section) is limited by the maximum propagation delay of a path through that logic section.
One way of increasing system throughput is to break up logic sections into smaller sections (each with a shorter propagation delay) and insert pipeline register-section levels to separate the smaller logic sections. The clock speed can then be increased to take advantage of the shorter logic-section delays.
This "pipelining" technique has been used to obtain significant speed-up of a computer system. FIG. 1a illustrates conventional pipelining, showing the edges of signals propagating though small combinational-logic blocks. Conventionally, a combinational-logical-function unit is partitioned into several smaller combinational-logic blocks, and register stages are inserted between adjacent combinational-logic blocks as the synchronizers. However, the inserted register stages contribute to increased physical area and added clock-distribution requirements, resulting in a limitation on performance.
The increasing demand for high-speed, compact devices and systems, and the limitations of existing design methods, have prompted researchers to look for alternate techniques that can lead to high-performance digital systems. One such method is called "wave pipelining". Wave pipelining eliminates intermediate register stages in a pipeline system by using the internal capacitance of a combinational block for storage. Wave-pipelined systems do, however, have strict requirements on (a) the uniformity of path delays, (b) uniformity of output-signal rise and fall times, and (c) the independence of delay from the pattern of input signal transitions.
FIG. 1b shows one embodiment of a wave-pipelining technique. In FIG. 1b, the internal capacitances in the combinational logic act in effect as temporary storage elements. These dynamic storage elements take the place of static registers used in the conventional pipelining method shown in FIG. 1a. Under the approach shown in FIG. 1b, new data values are latched in before the previous data values propagate to the next set of registers. In this way, there are multiple coherent data "waves" within the combinational-logic block. Hence, the system clock is much faster than the propagation delay of the combinational-logic block between adjacent system-clocked-register stages.
The concept of wave pipelining (also called "maximum-rate pipelining") was first described by Cotten [Cotten:69] and Anderson, et al. [Anderson:67], and was applied in the design of IBM360/91 floating-point execution unit in the 1960's. The significant advantages of wave pipelining are:
(1) Achieving very high pipeline rates that approach the physical speed limit of the technology; PA1 (2) Increasing pipeline rate without significant latency increase; PA1 (3) Minimizing clock loading and reducing clock-distribution problems; and PA1 (4) Using fewer registers and reducing the area overhead otherwise required by conventional pipelining. PA1 t.sub.cp is the valid clock period, PA1 .DELTA.t.sub.p is the maximum time difference between the longest and shortest paths for the worst-case design, PA1 .DELTA.C is the worst-case clock skew, PA1 t.sub.s is the setup time for registers, PA1 t.sub.h is the hold time for registers, PA1 t.sub.rf is the worst-case rise/fall time at the last logic stage, PA1 .DELTA.t.sub.x is the maximum time difference between the longest and shortest path from the global inputs to an internal signal node X, and PA1 t.sub.ms is the minimum stable time for X to insure the correct operation of the next logic stage. PA1 (1) path differences due to practical circuit configurations, PA1 (2) data-dependent signal-delay variations, and PA1 (3) process- and temperature-induced variations. PA1 (1) same gate delay for both rising and falling edges of output signal, PA1 (2) no variation in the gate delay due to different input patterns, and PA1 (3) no variation in the gate delay due to different previous input patterns.
To obtain a high operating speed, each path through a given functional block must have similar path delays. This requires symmetric rise and fall times (collectively called "transition" times) of output signals, and, for each component within the logical-functional block, delays that are independent of the input-signal transition patterns. Wave-pipelined systems are susceptible to process and environmental variations which will cause propagation-delay-variation problems [Klass:93b].
Recently, with the demanding digital system speed and throughput requirements of various applications, wave-pipelining has received considerable attention from many research groups [Wong:93][Fan:92][Klass:92][Zhang:93]. In addition, Ekroot [Ekroot:87] developed a theory of wave pipelining and a linear program to insert delay elements to balance the circuit with the assumptions of fixed gate- and module delays.
Wong et al. [Wong:93][Wong:91] continued their initial research and developed the algorithms to automatically equalize delays in bipolar combinational logic circuits to achieve a high degree of wave pipelining. These authors have also reported the results of a 63-bit population counter using CML (Common-Mode Logic) bipolar technology, and discussed the limitations of using standard CMOS technology for wave pipelining.
Fan et al. [Fan:92], and Klass and Mulder [Klass:92] studied the use and limitations of CMOS technology for wave pipelining. They designed wave-pipelined CLA (Carry Look-Ahead) adders and showed performance improvement over conventional methods.
Lam et al. [Lam:92] analyzed valid clocking in wave-pipelined circuits using Timed Boolean Functions.
Joy and Ciesieski [Joy:91] have proposed certain physical placement of components and specific routing algorithms for laying out wave-pipelined circuits. Klass, Flynn and Goor reported the design of a fast CMOS wave-pipelined multiplier [Klass:93b][Klass:93a].
The timing constraints of wave-pipelined circuits have been carefully studied and discussed by several research groups. In summary, for a wave-pipelined system using edge-triggered registers, the minimum clock-period relation should be [Cotten:69][Klass:92][Wong:91]: EQU t.sub.cp &gt;Max {(.DELTA.t.sub.p +(2*.DELTA.C) +t.sub.s +t.sub.h +t.sub.rf), (.DELTA.t.sub.x +.DELTA.C +t.sub.ms +t.sub.rf)} {Equation 1}
where the variables are defined as
Both transition times and signal-propagation delays must be constrained to avoid data wave interference. The clock period time limit to prevent interference of a data wave with any previous data wave at the ending storage element of a wave-pipelined logic section is bounded by t.sub.cp &gt;(.DELTA.t.sub.p +(2*.DELTA.C) +t.sub.s +t.sub.h +t.sub.rf). The clock period time limit to prevent interference of a data wave with any previous data wave inside a section of combinational logic is bounded by t.sub.cp &gt;(.DELTA.t.sub.x +.DELTA.C+t.sub.ms +t.sub.rf).
To achieve maximum wave-pipeline-rate, designers should minimize t.sub.cp in Equation 1. Here, it is assumed that the clock skew .DELTA.C can be minimized by conventional design techniques, and that the terms t.sub.s, t.sub.h, t.sub.rf, and technology-dependent parameters and specific to a certain logic stage, so they can be optimized individually. The remaining terms, .DELTA.t.sub.p and .DELTA.t.sub.x, arise from the following possible sources:
As some process- and temperature-induced variations are unavoidable, the focus should be on the path differences that are due to practical circuit configurations and data-dependent delay variations. Therefore, if possible, a wave-pipelined circuit should be designed to have balanced paths (in terms of the basic logic gates and delay elements) in order to keep .DELTA.t.sub.p and .DELTA.t.sub.x as close to zero as possible.
Unfortunately, most practical digital circuits do not have such balanced configurations. Therefore, specific algorithms have been suggested for designing practical wave-pipelined circuits by inserting delay elements ("rough tuning") and adjusting gate-driving abilities ("fine tuning") [Wong:93][Wong:89].
Even for a balanced circuit, the data-dependent delay variations of logic gates can still contribute to the values of .DELTA.t.sub.p and .DELTA.t.sub.x. This fact establishes that, from the viewpoint of circuit designers, the minimum clock period is eventually bounded by the delay variations of the basic logic circuit used in a wave-pipelined system. Therefore, the choice of the circuit family for the wave-pipelined system design can have a significant impact on performance through the effect of delay variations at the gate level. A set of ideal properties of the basic circuits for wave pipelining can be summarized as follows:
By examining these requirements, it was found that bipolar circuit families (Emitter-Coupled Logic ("ECL"), super-buffered ECL, and Common-Mode Logic ("CML")) are good candidates for wave pipelining [Wong:93]. Standard CMOS was not well suited for this technique, since CMOS gate delay depends strongly on the input patterns or different signal-timing patterns [Klass:92][Fan:92]. For example, the standard prior-art two-input CMOS NAND gate 10 shown in FIG. 1c has two transistors in parallel (21 and 22) and two transistors in series (23 and 24). The physical characteristics of transistors 23 and 24 can be designed so that together they pull output 31 down to a logic "zero" at a rate corresponding to the rate that transistors 21 and 22 together can pull output 31 up to a logic "one". In such an embodiment, if input signals 11 and 12 both start at "one", and both switch to "zero", transistor 21 and transistor 22 will both switch, driving output 31 from ground potential 14 to V.sub.DD voltage 15. If, however, only a single input switches to "zero" (e.g., input 11 ), only a single transistor (e.g., transistor 21) will pull output 31 to V.sub.DD voltage 15. Since there is some capacitance associated with output 31, when both transistors 21 and 22 are pulling output 31, output 31 will switch faster than if either transistor 21 or 22 alone is driving output 31. Therefore, in CFET NAND gates, rise times vary as a function of the input state transitions.
Since CMOS technology is a dominant and mature technology in the modem semiconductor industry, and has certain unique positive features for digital system design, it is necessary to attack the practical problems of unequal delays and asymmetric rise and fall times and to explore novel design techniques that are suitable for CMOS wave pipelining. Researchers have studied the basic logic-circuit issues of CMOS wave-pipelining technique and have proposed some solutions. For instance, in [Fan:92] and [Gray:91], the basic logic circuits used are an inverter (not shown) and a two-input cross-coupled pseudo-NMOS NAND gate 40 (shown in FIG. 1d), which is formed by stacking cross-coupled n-channel transistors under a p-channel active pull-up device with bias voltage Vb. Since, however, the bias voltage Vb has to be distributed all over the wave-pipelined circuit chip, and the gate delay is sensitive to the bias-voltage value, careful routing is needed to insure proper functioning of the circuit [Fan:92].
In an alternative approach, a balanced CMOS NAND gate (FIG. 1e) is proposed in [Klass:92] to reduce the static CMOS gate-delay variations by adding a redundant ground-biased PMOS device to "soften") the input-pattern-dependent delay variation. This approach, however, has the drawbacks of increased layout area, loading capacitance, gate delays and dynamic power dissipation.
Klass [Klass:93a] describes a wave-pipelining circuit using standard CMOS logic gates. In [Klass:93b] and [Klass:93a], a conventional static CMOS NAND gate and an invertor were used as the basic circuits; however, the design was restricted to use 2-input NAND gates and invertors for every logic function, to minimize the delay sensitivity of the circuit to the input data patterns. In addition, every function block had to be verified separately to avoid large delay variations.
Each of the above approaches use only 2-input NAND gates and invertors as the basic circuits to implement arbitrary logic functions. This constraint can lead to a large chip area, and will limit the applications of wave pipelining.
Wong [Wong:93] presents an algorithm for designing a wave-pipelining circuit with minimal area and minimal power consumption. The algorithm involves: (1) rough tuning, by adding delay elements to balance circuit paths; and (2) fine tuning, by adjusting gate drives to compensate for delay variations introduced by different "fanouts" (the number of loads; in CFET technology this is primarily the sum of the capacitive load of each gate driven by the output driver, plus the capacitance of inter-circuit wiring).
Other FET logic families have also been explored. For instance, Complementary Pass-transistor Logic ("CPL") has proven to be a high-speed, area-efficient, and low-power technique [Yano:90][Weste:93][Shimohigashi:93]. FIG. 1f shows an example of a basic prior-art CPL logic circuit 60 [Yano:90]. In the embodiment shown in FIG. 1f, the same circuit is used to implement AND, NAND, OR, and NOR functions; the function is determined by selection of the signals provided at the circuit inputs. The design method presented by Yano et al. [Yano:90] had no p-channel transistor in the pass network. Dual input signals and n-channel pass-transistors were used to implement dual-output gate circuits.
The circuit shown in FIG. 1f does have drawbacks. Circuit 60 does not make efficient transitions with respect to logic-high input signals because of the poor "one" conduction problem of the NMOS pass-transistors (the maximum voltage for logic "one" is bounded by V.sub.DD -V.sub.T). So Yano et al. [Yano:90] utilized a specific fabrication technology, in which NMOS pass-transistors 62 were designed to have a zero threshold voltage V.sub.t =.+-.0 volts, whereas the other NMOS and PMOS transistors had a V.sub.T =.+-.0.4 volts, respectively. With this design method, the quality of the logic-high is indeed improved, but noise immunity and reliability are reduced. In addition, the special fabrication requirements limit its wide application.
None of the above methods appear to teach how to design a family of field-effect-transistor-based circuits which provide substantially equal delays regardless of the pattern of the input logic-state transitions, and which provide a high-quality logic one as well as a high-quality logic zero.