A synchronous digital circuit comprises combinatorial logic circuitry and register elements. The register elements, such flip-flops responsive to a rising or falling edge of a periodic clock signal, store current logic states. The combinatorial logic circuitry computes the new logic states based on the current logic states and the computed logic states are stored to the registers on the next clock edge.
FIG. 1 shows a generalized prior art single clock synchronous circuit 100, where a clock source 110 drives individual register clocks via a clock distribution network 120. Clock distribution network 120 is ordinarily designed to minimize clock skew, which may be defined as variations in the propagation delays from the clock source 110 to the clock input of the registers 140, 150, etc. In order to clock all the registers simultaneously, propagation delays between the clock source 110 and clock input of each register 140, 150, etc, denoted as X1 through Xn, must be exactly the same. Between each stage of register is combinatorial logic circuitry 130 with lumped signal propagation delays denoted as Dij which is a summation of clock to Q delay of REGi 140 and the propagation delay from an output of REGi 140 to a data input of REGj 150, through the input 131 and the output 132 of the combinatorial logic circuitry 130. FIG. 1 illustrates a simple combinatorial delay path from a register 140 output to the data input of a register 150 for simplicity, however any synchronous input of a register such as clock enable and synchronous set/reset could be an end point of the delay path. In general, the delay value of Dij varies from MINij to MAXij, the minimum and maximum delay values of Dij, respectively. The value of propagation delay between an input 131 and an output 132 of a combinatorial logic circuitry 130 may vary when there are multiple signal paths between the input 131 and the output 132. Variations in manufacturing process, operating voltage and temperature also affect the delay values.
To ensure correct circuit operation for each register, the following two constraints must hold for every combinatorial path from REGi to REGj with propagation delay (MINij, MAXij);
Xi + MINij ≧ Xj + HOLDj(CONS-1)Xi + MAXij + SETUPj ≦ Xj + P(CONS-2)where p denotes the operating clock period, HOLDj and SETUPj correspond to hold and setup time of REGj, respectively, and Xi and Xj are the individual clock delays applied to REGi and REGj, respectively. The first constraint CONS-1 ensures that the output of REGi generated by a clock edge arrives at REGj no sooner than HOLDj amount of time after the latest possible arrival of the same clock edge. The second constraint CONS-2 ensures that the output of REGi generated by a clock edge arrives at REGj no later than SETUPj amount of time before the earliest arrival of the next clock edge with period P.
Clock skew optimization, often denoted as “cycle-stealing”, is a technique to minimize the clock period of a synchronous circuit by adjusting the path delays of the clock signal from the clock source to the clock input pin of individual register elements. The clock skew optimization problem was first formalized in the journal article entitled, “Clock Skew Optimization,” by J. P. Fishburn, which appeared in “IEEE Transactions on Computers,” pp. 945-951, July 1990. According to this article, if we consider P and Xi as unknown variables, then the problem of minimizing P, while satisfying the constraints CONS-1 and CONS-2 for every pair of registers REGi and REGj, is formulated as following linear equations:
CLOCK_SKEW_OPTIMIZATION (P)MinimizePsubject toXj − Xi ≦ MINij − HOLDj(EQN-1)Xj − Xi + P ≧ MAXij + SETUPj(EQN-2)
for 1≦i≦n, 1≦j≦n, where n is the number of registers in a same clock domain.
The above clock skew optimization problem can be efficiently solved by the Bellman-Ford algorithm coupled with a binary search described in the conference paper entitled, “A Graph-theoretic Approach to Clock Skew Optimization,” by R. B. Deokar and S. S. Sapatnekar, Proc. ISCAS, pp. 1407-1410, 1994. The Bellman-Ford algorithm is described in the textbook entitled, “Introduction to Algorithms,” by T. H. Cormen et al, pp. 532-543, MIT Press, 1993.
For the zero clock skew case, where Xi equals to Xj for all i and j, equation EQN-2 is reduced to P≦MAXij+SETUPj, meaning that the minimum clock period P is equal to a summation of the largest combinatorial logic path delay and setup time of the register, which is often called “critical path delay”. In this case, the minimum period P is just a feasible solution, not necessarily an optimal solution. In general, it is possible to achieve an optimal clock period that is smaller than the critical path delay by utilizing non-zero clock skew computed by solving the previously described linear equations.
FIG. 2 shows a prior art programmable logic fabric 250, the internal architecture of the programmable logic device such as an FPGA (Field Programmable Gate Array), in which a custom circuit is implemented, comprising a plurality of configurable logic blocks 230, configurable input/output (I/O) blocks 210, the routing network 220 and the configuration memory 240.
The configuration data stored in configuration memory 240 define the desired functionality of embedded logic elements of the configurable logic 230 and I/O blocks 210 and generally also turn switch elements of the routing network 220 on or off to properly interconnect those blocks together. The configuration memory 240 can be any type of storage device including static RAM (SRAM), EPROM, EEPROM, flash memory, fuse, anti-fuse, or mask programmable metallization such as via. User designed custom circuits can be implemented by properly programming the configuration memory 240 of the fabric 250, where the content of the configuration memory or configuration data is typically created by starting with a design description usually written in a HDL (Hardware Description Language) 260 which is read by a design implementation tool 270 which comprises a set of software applications such as synthesizer, mapper, placer, router and bitstream generator which generates configuration data 280, which is written into configuration memory 240 of the programmable logic fabric 250, optionally using programmer 290. Typical prior art design implementation tool chain is shown as the steps of 810 in FIG. 8. The behavior of a custom design to be implemented in a programmable logic fabric is described in HDL 812 such as Verilog and VHDL. The synthesizer takes the description along with the design constraints 811 including timing requirements, and produces a “netlist” which describes the connectivities among the library elements modeled after the logical components embedded in the fabric such as LUT and register. The netlist is then processed by the mapper producing another type of netlist, often called “technology mapped netlist”, representing the connectivities among the architecture specific features such as configurable logic block and configurable I/O block, by packing the library elements into these blocks. The functionalities of the library elements packed into a block and their intra-block connectivities are converted to the configuration bits representing the functionality of the block containing these library elements. The mapped netlist is taken to the placer in which each block in the netlist is assigned to a specific location in the fabric. The router realizes required interconnections among the placed blocks by selecting wire segments and switch elements within the fabric's routing resources. The bitstream generator 870 takes the routed design 815 and converts it to bitstream file 875 that can be used to configure the programmable logic fabric. In general, the performance of the implemented circuit largely depends on the quality of the implementation software. The configuration memory 240 must be programmed with the design-specific configuration data 280 prior to logic circuit operation of the fabric.
Configurable I/O blocks 210 of programmable logic fabric 250 provide an interface between the logic fabric internals and the external circuitry through the I/O ports 251. For a stand-alone FPGA, those blocks are connected to input/output pads. For an FPGA core embedded in an ASIC (Application Specific Integrated Circuit), the I/O blocks might be connected to the internal nodes of circuit implemented in the other portion of ASIC. Configuration memory bits are used to control the direction of I/O signal flow, driving strength of output buffer, signal registering and many other configurable parameters.
The routing network 220 of logic fabric 250 distributes the internal signals. FIG. 3 shows a prior art routing network 310 comprising a plurality of switch multiplexers 320 330 connected together by a plurality of wire segments 340. The inputs of the routing network drive some inputs of the constituent switch multiplexers and the outputs of the routing network are extended from some outputs of the switch multiplexers. Typically, each input drives the multiple switch multiplexers and there exist multiple routable paths between a route pair having an input and an output. The number and widths of switch multiplexers, and their connectivity patterns vary depending on the fabric architecture. The switch multiplexer 320 comprises a plurality of programmable switch elements and the output of the switch multiplexer is usually buffered in the deep-submicron programmable logic fabric. The routing network typically incorporates various wire segments with different length where a short wire is used for fast local interconnect, while a long wire is for distributing high-fanout signal traveling longer distance. An exemplary buffered switch multiplexer comprising a buffer 321 and a plurality of pass transistor switches 322, each controlled by a configuration memory bit 323 is shown in FIG. 3. Two wire segments extended from the input 311 and the output 312 of the routing network 310 can be connected or disconnected by programming a configuration memory bit 323 which controls the on-off state of switch element 322. Similarly, the input 311 and the output 313 of the routing network 310 can be connected by turning on the switch element 322 and another switch element connected to the input 331 of the switch multiplexer 330. In this case, the routed path between 311 and 313 passes through two switch elements. In general, the more the switches in a routed path, the larger the delay in the path. The buffer 321 could be an inverter or a tristate buffer. In this “active” interconnect scheme, each routing switch connection is buffered at the output, which provides a constant interconnect delay independent of the signal fanout. This makes it easier to predict the interconnect delay during a timing-driven map, place and route process that helps to deliver the better performance. However, the signal can only be driven from input to output in a buffered switch multiplexer while it may be driven in both directions in an unbuffered switch multiplexer. Since only one switch element in a switch multiplexer can be turned on at a time, the switch elements may be controlled by encoded memory bits rather than individual memory bits. Also, a wide input switch multiplexer may be constructed in multi-level switches forming a tree structure rather than flat, single-level switches 320 as shown in FIG. 3. A typical routing network incorporates various types of switch multiplexers such as buffered or unbuffered output, wide or narrow or even single input, encoded or unencoded control memory bits, and single-level or multi-level switches. Various types of prior art switch elements may be used, including a pass transistor 322, transmission gate, fuse, anti-fuse, mask programmable via/metal segment, or any type of programmable switch element known in the prior art.
Due to the high-fanout nature of the clock signal 110 shown in FIG. 1, most programmable logic fabric incorporate a dedicated clock distribution network to efficiently distribute the clock signals to every register element with minimal clock skew. Modern FPGA devices contain hundreds of thousands of register elements 140 150 of FIG. 1.
FIG. 4 shows a typical clock distribution network 400 employed in a prior art programmable logic fabric. The clock tree typically comprising the root node 450, horizontal spines 451, vertical spines 452, the leaf nodes 453, and associated buffers 454, is designed in such a way that the delay from the root node 450 to each and every clock input of the register elements 460 driven by the leaf nodes 453 are equalized so as to minimize the clock skew from one register 460 to another register 460, thereby providing a substantially similar delay from one clock leaf to another. The clock source 420 may originate from an internal clock source or an external source connected to the clock input pin. In a direct distribution configuration, the clock source multiplexer 440 selects the clock source 420 directly, and distributes this root node 450 to horizontal spines 451, which are then coupled to a set of vertical spines 452, which buffer the clock signal and apply them to the registers 460 via leaf nodes 453. In this configuration, each register has the same clock delay which is the propagation delay of the clock tree. In a PLL (Phase Locked Loop) or DLL (Delay Locked Loop) distribution configuration, the multiplexer 440 selects an output of PLL/DLL 430 which generates an internal clock synchronized to the incoming source 420 using feedback 431. In the prior art implementation of FIG. 4, one leaf node 453 feeds back to an input 431 of PLL/DLL 430 to synchronize the clock phase of the signal distributed to the registers via leaf nodes with the external clock source 420, thereby compensating for the propagation delay of the clock tree. In this manner, the plurality of register elements 460 on each of the leaf nodes 453 can be synchronized to a single incoming clock source 420. The clock source multiplexer 440 may select a signal from the routing network 410 to distribute the clock signal derived from the clock source 420 such as a gated clock signal. A dedicated and balanced clock tree embedded in the programmable logic fabric allows the reliable clocking of registers at synchronous points in time with minimal clock skew from one register to another. On the other hand, a clock signal routed through general purpose routing resources can not be synchronized with the clock source and may incur a large amount of clock skew due to uneven route-dependent clock delays associated with undesirable clock signal routing paths. It is a common practice to embed a plurality of the clock trees into the fabric to distribute multiple minimum skew clock signals as typical applications require.
FIG. 5 shows a configurable logic block (CLB) 501 along with the input switch matrix (ISM) 502, clock tree leaf nodes 510, and the routing network 503. The CLB 501 comprises a plurality of configurable logic elements including configurable register element 540 and configurable combinatorial logic elements such as LUT (Look-Up Table) 530. There may be several identically structured LUTs, registers, and multiplexers, such as 531, 532, 541, 542 and 550, as well as complex logic elements such as arithmetic logic and wide-input multiplexers, not shown herein for simplicity. For a LUT-based fabric, desired combinatorial logic functions can be implemented by programming the configuration memory bits (not shown) representing LUT contents. For example, a 4-input LUT (LUT4) which contains 16 bits of memory can realize any 4-input, 1-output Boolean logic function by implementing a fully populated 4-input, 1-output truth table, as known in the prior art. The register element may have other control inputs, or optional ports, such as set/reset and clock enable pins which are not shown herein for simplicity. Various modes of register 540 operation can also be configured by programming the configuration memory bits (not shown) associated with register functionality such as the polarity of clock edge, synchronous set/reset mode, flip-flop/latch mode, and other prior art register functions. The ISM 502 comprises an array of the switch multiplexers where the output of each multiplexer drives at least one CLB input and the inputs of each multiplexer are connected to some of the incoming wires from the leaf nodes of clock trees 504, the outputs 505 of routing network 503, the bounce-back 506, or the feed-back 525. The incoming wires to ISM 502 are represented in vertical lines in the FIG. 5. The horizontal lines in ISM 502 arrowed to CLB inputs correspond to the output wires of switch multiplexers driving CLB inputs. The directional switch elements 503 of the multiplexers, denoted as ‘>’ marks, are sparsely populated at the cross points between the incoming wires and the output wires of the multiplexers. The number and the locations of the switch elements on the ISM 502, translated to the width and input connection pattern of the switch multiplexers, vary depending on the fabric architecture. Signals presented on the incoming wires to ISM 502 are routed to inputs of the CLB 501 by turning on the appropriate switch elements on the ISM 502. For example, a global clock signal distributed through leaf node 511 can be routed to the clock pin of register 540 by turning on the switch element 512 in the clock selection multiplexer 560. Similarly, a locally generated clock signal distributed through the output 521 of the routing network 503 can be routed to the register 541 by turning on the switch element 522 in the clock selection multiplexer 561. Signals can also be selected through the switch multiplexers reside in CLB. For example, D-input signal of register 541 can be selected from either LUT4 531 or the output 551 of data input multiplexer 573 in ISM 502 through properly configured data select multiplexer 550 in CLB 501. The switch configuration data, representing on-off state of the switches, usually created by the router, is stored in the configuration memory as described earlier. Unlike other switch multiplexers in ISM 502, the clock selection multiplexer 560 561 562 takes the inputs from the leaf nodes of clock trees 504. It may also take the inputs from the routing network 505 to distribute any locally generated clock signals such as a registered clock signal generated by a sequential circuit such as a divide-by-N counter. The clock selection multiplexer may drive more than one clock input and may comprise multiple-levels of the switch multiplexers. Although the embedded clock trees are dedicated to distribute high-fanout clock signals to the clock pins, it would be useful to utilize unused clock trees for routing high-fanout signals to any inputs of the logic elements other than clock inputs. For example, in the Virtex-4 architecture from Xilinx Inc., the input access of the global clock lines are not limited to the clock pins of the logic resources—the leaf nodes of the global clock tree can access other inputs in the CLBs such as LUT inputs and the set/reset inputs of the register. This may be done through the clock signal “bounce-back” structure, a 2-level multiplexer structure formed as the clock selection multiplexer 560 to the LUT input selection multiplexers 570 571 572 through the bounce-back wire 514. For example, a high-fanout signal distributed through the leaf node 511 can be switched to the bounce-back wire 514 by turning on switch 512 in clock selection multiplexer 560 and it can be routed to each input of LUT4s 530, 531 and 532, through switches 515, 526 and 527, respectively. Outputs from the embedded logic elements of the CLB 501 such as LUT4 530 and register 540 drive the inputs of routing network 503 and some outputs may feed back 525 to the ISM 502 for cascading other logic elements in the same CLB with minimal interconnect delay. For example, the output of LUT4 531 can be routed to an input of LUT4 530 through the feed-back wire 525 and switch 528 forming a fast, single-switch routing path.
To maximize the operating frequency of a circuit by adjusting the clock skew, a special apparatus for implementing adjustable clock skew and corresponding method for utilizing the apparatus are required in a programmable logic fabric.
U.S. Pat. Nos. 6,873,187 and 6,486,705 by Andrews et al introduced the fractional cycle stealing units 580 in the routing of FPGA. The delay lines 582 583 584 585 in the unit 580 have distinct delay amounts that must be pre-defined prior to the chip fabrication. A modified Bellman-Ford algorithm selects a particular one of the selectable delay lines 582 583 584 585 through a configurable multiplexer 581 for each of the units to increase system performance resulting from the particular clock routing. This approach requires significant hardware cost to implement the selectable delay lines of the units for every clock input of register elements, and it is difficult to pre-define a good set of delay values for the selectable delay lines covering wide range of the applications with different performance requirements.
The publication “Constrained Clock Shifting for Field Programmable Gate Array” by Singh and Brown, Proc. 10th International Symposium on Field Programmable Gate Arrays, Monterey, Calif., pp. 121-126, February 2002, utilizes unused dedicated clock networks to distribute a finite set of clock skews to the every registers. Clock skews are generated by a phase shifting circuitry on the clock network. The advantage of this approach is minimal hardware overhead, however, it is not applicable when all the dedicated clock lines are consumed by multiple clock signals in a custom design with multiple clock domains.
The publication by C. Y. Yeh, M. Marek-Sadowska, “Skew-programmable Clock Design for FPGA and Skew-aware Placement” (Proc. 13th International Symposium on Field Programmable Gate Arrays, Monterey, Calif., pp. 33-40, February 2005) describes embedding programmable delay elements into the major branches of the clock tree such as the buffer 454 locations in FIG. 4. The hardware overhead of this approach may be lower than the approach of Andrews but it also requires a pre-defined set of fixed delay elements. Another drawback of this approach is that it requires a special placement algorithm which takes the delay-embedded clock trees into consideration.