1. Field of the Invention
The present invention relates to domino circuit topology. In particular, the present invention relates to a dual tail time borrowing multiplexer domino circuit topology using a complementary-device CMOS logic gate.
2. Background of the Related Art
Conventional microprocessors rely on several architectural and circuit techniques to maximize CPU performance, including but not limited to:                Zero-level Bypassing        Several ALUs        High-speed circuit techniques        
Zero-level bypassing is an architectural technique to maximize the architectural performance of the CPU. In zero level bypassing, the output data from one ALU may be the input data to any other ALU in the next cycle and this can occur for all ALUs in parallel during one clock cycle. A physical block diagram is shown in FIG. 1. In this way, dependent instructions may be executed in consecutive clock cycles without waiting for the results of one instruction to be written back to a register file or other memory circuit.
This topology creates both timing and routing problems. Timing becomes difficult since every ALU must transmit its result to all other ALUs-some of which may be several hundred microns away. Also, each ALU must receive inputs from all other ALUs and therefore must employ a wide-multiplexer to choose the correct source data. Routing is also constrained by this topology. A microprocessor may have 5 ALUs. Therefore, it requires 5 wires per bit to provide ALU to ALU pathways, another pathway for incoming cache data, and another pathway to provide overrides for “immediate” data; a total of 7 pathways per bit. Five ALU's are shown in FIG. 1 as an example.
There is a fundamental speed limiting path that exists with zero-level bypassing. The path starts from the clock of the input multiplexer to the zero-level bypassing outputs. The path then proceeds from the bypassed outputs through the ALU to create a computational result. This computational result can be transmitted to the furthest ALU. The result must be transmitted before the setup time (relative to the next clock) of the furthest ALU's zero-level bypassing mux-latch. This path is fundamental in microprocessor designs and therefore high-speed circuit design techniques, such as domino circuit design, are very commonly employed to speed up this path.
The foregoing approach, while somewhat effective, is not without drawbacks. For example, domino structures exacerbate the routing problem since domino logic requires data and the logical inverse, data#, to be generated for certain ALU functions. Data and data# must be domino compatible. Therefore, fourteen pathways would be required in a conventional domino implementation, which is considered excessive. This many signals routed in a data path will lead to high interconnect resistance and capacitance. Thus, in very wide ALU stacks, it is generally not practical to route both data and data# from each ALU.
While routing single rail data between ALUs helps global timing, the first thing the ALU must do locally (either before or after the multiplexer) is to create data# from data. Note that when data# is generated locally with a simple inverter, it is not domino-compatible. Also, since there are so many inputs, a domino multiplexer is advantageous for speed purposes. One other caveat is that the enable signals (for the multiplexer) must have one enable signal at a logic high state at all times.
The foregoing constraints strip away the ability to time-borrow through a domino multiplexer. This situation creates an absolute hard timing edge (shown in FIG. 2), for which there can be no transparency, to prevent a false evaluation. Tn FIG. 2, the path starts in one ALU at time “t0” at the domino multiplexer with “clk→out” representing the amount of time to generate valid outputs of the multiplexer after the arrival of the clock edge. “ALU delay” is the propagation time through the ALU in addition to the delay from the output of the ALU all the way back to the input of the next ALU. The signal must complete its propagation and setup to the multiplexer input prior to the rising edge of the next clock cycle. This means the design must pay the full penalty of clock skew and jitter—which can be a high percentage of the total cycle time.
These and other disadvantages exist in conventional circuitry.