1. Field of the Invention
The invention pertains to the field of integrated circuits. More particularly, the invention pertains to field programmable gate array integrated circuit devices.
2. Description of Related Art
Field Programmable Gate Array (FPGA) integrated circuit devices are known in the art. An FPGA comprises any number of initially uncommitted logic modules arranged in an array along with an appropriate amount of initially uncommitted routing resources. Logic modules are circuits which can be configured to perform a variety of logic functions like, for example, AND-gates, OR-gates, NAND-gates, NOR-gates, XOR-gates, XNOR-gates, inverters, multiplexers, adders, latches, and flip/flops. Routing resources can include a mix of components such as wires, switches, multiplexers, and buffers. Logic modules, routing resources, and other features like, for example, I/O buffers and memory blocks, are the programmable elements of the FPGA.
The programmable elements have associated control elements (sometimes known as programming bits or configuration bits) which determine their functionality. The control elements may be thought of as binary bits having values such as on/off, conductive/non-conductive, true/false, or logic-1/logic-0 depending on the context. The control elements vary according to the technology employed and their mode of data storage may be either volatile or non-volatile. Volatile control elements, such as SRAM bits, lose their programming data when the PLD power supply is disconnected, disabled or turned off. Non-volatile control elements, such as antifuses and floating gate transistors, do not lose their programming data when the PLD power supply is removed. Some control elements, such as antifuses, can be programmed only one time and cannot be erased. Other control elements, such as SRAM bits and floating gate transistors, can have their programming data erased and may be reprogrammed many times. The detailed circuit implementation of the logic modules and routing resources can vary greatly and must be appropriate for the type of control element used.
Typically a user creates a logic design inside manufacturer-supplied design software. The design software then takes the completed design and converts it into the appropriate mix of configured logic modules and other programmable elements, maps them into physical locations inside the FPGA, configures the interconnect to route the signals from one logic module to another, and generates the data structure necessary to assign values to the various control elements inside the FPGA.
Many FPGA architectures employing various different logic modules and interconnect arrangements are known in the art. Some architectures are flat while others are clustered. In a flat architecture, the logic modules may or may not be grouped together with other logic modules, but all of the logic modules have free access to the larger routing architecture.
In a clustered architecture, the logic modules are grouped together into clusters which typically have a two level hierarchy of routing resources associated with them. The first level typically makes interconnections internal to the cluster while the second level typically allows interconnections between clusters. FIG. 1 illustrates a block diagram of a prior art logic cluster which illustrates the basic principles of a clustered architecture. The logic cluster contains four logic modules each comprising a logic function generator circuit of a type sometimes called a look-up table (or LUT) each having four inputs which are designated LUT4 in the diagram. Each LUT4 has an associated flip/flop designated FF. The output of each LUT4 is coupled to the data input of the associated flip/flop. The output of each LUT4 and each flip/flop is coupled to the block designated Cluster Internal Routing Lines which is the first level of the routing hierarchy. The output of each LUT4 and each flip/flop is also coupled to the block designated External Horizontal & Vertical Routing Lines which is the second level of the routing hierarchy.
In the architecture of FIG. 1, signals are transmitted from the second level of the architecture to the first level by means of the ten Cluster Input Multiplexers coupled between the External Horizontal & Vertical Routing Lines and the Cluster Internal Routing Lines. Various lines and resources from other parts of the FPGA are connected to the inputs of the Cluster Input Multiplexers by means of the External Horizontal & Vertical Routing Lines. The lines internal to the Cluster Internal Routing Lines block come from a variety of sources: the outputs of the Cluster Input Multiplexers, the outputs of the cluster's LUT4s and flip/flops, and possibly other sources such as clock networks and other special functions not shown in FIG. 1 to avoid overcomplicating the diagram.
The LUT4 Input Multiplexers in FIG. 1 are coupled between the Cluster Internal Routing Lines block and the various inputs on the LUT4 blocks. Since there are four LUT4 blocks each with four inputs, there are a total of sixteen LUT4 Input Multiplexers in the cluster. In general, the number of inputs to each LUT4 Input Multiplexer is less than the total number of lines in the Cluster Internal Routing Lines block, so each LUT4 Input Multiplexer can only transmit a subset of those signals to its associated LUT4 input.
Note that in FIG. 1 there are only ten Cluster Input Multiplexers while there are sixteen LUT4 inputs. This places certain restrictions on the place and route software tool (or tools), since in the case of FIG. 1 no sub-circuit with more than ten logic inputs can be placed in a single cluster. This restriction is the defining difference between flat and clustered FPGA architectures. FPGA designers who accept this restriction, believe that the overall area required by the Cluster Input Multiplexers and the LUT4 Input Multiplexers is less than the area that would be required to only have LUT4 Input Multiplexers and eliminating the first level of routing hierarchy. In a clustered architecture, the less numerous Cluster Input Multiplexers tend to have a large number inputs while the more numerous LUT4 Input Multiplexers have fewer inputs. In a non-clustered architecture, the LUT4 Input Multiplexers would have to many more inputs to achieve the equivalent routing capability.
The determination of whether to build an FPGA in a clustered or non-clustered architecture depends on a great many factors like the cost of various silicon features, the programmable technology being employed, the familiarity of the designers with one approach or the other, and various issues related to the design software, and is beyond the scope of this disclosure. However both architectural approaches can be found in commercial FPGAs.
One area where FPGA manufacturers typically attempt to enhance their products is in the area of computer arithmetic. This typically takes the form of adding some sort of carry circuit coupled to the logic function generator in each logic module which accepts a carry input from an adjacent logic module and propagates a carry output to a different adjacent logic module, typically on the other side so that carry chains can propagate along a row or column of the FPGA array. Efforts are generally directed towards doing ordinary addition quickly and efficiently, since other operations such as subtraction, multiplication, and magnitude comparison can be efficiently performed by judicious use of adders.
FIG. 2A shows the logic for a full adder circuit known in the art. FIG. 2B shows the logic truth table for the full adder while FIG. 2C shows the full adder circuit used to implement a three-bit ripple adder known in the art. The full adder circuit has two operand inputs represented by Ai and Bi in the diagram and a carry input from the previous stage that is designated as Ci. The full adder circuit has a sum output designated as Si and a carry output designated Ci+1. The subscript “i” is an integer variable which represents which bit position in a binary number the full adder cell is associated with in any given adder. Traditionally i=0 for the stage associated with the least significant bit of the adder. Each of the three inputs can have a binary value of either 0 or 1. Thus if one were to add all three bits together, one could get a decimal value of either 0, 1, 2, or 3 which would be represented as 00, 01, 10, and 11 respectively in two binary bits. The Ci+1 output represents the most significant bit of the sum and the Si output represents the least significant bit.
In the ripple adder of FIG. 2C, the full adders of the type shown in FIG. 2A are shown in a series carry arrangement. This means that depending on the operands A2-A0 and B2-B0, it is possible for a carry input signal to enter via C0, the carry input to the least significant bit of the adder, and propagate through the adder cells until reaching C3, the carry output of the most significant bit of the adder. For every stage (or bit position) the Ci+1 output becomes the Ci input of the next stage. For example, the carry output of the middle stage (called stage 1 because i=1 for all of the inputs) is designated C2 (where i+1=2) and becomes the Ci input of stage 2 (where i=2). This is analogous to humans doing decimal arithmetic. When two decimal digits are added together the result is between 0 and 19 if there is a carry in from the previous digit (because 9+9+1=19, which is the maximum value for a digit). If the sum for that digit is between 0 and 9 that is the value for that digit and the addition operation continues to the next significant digit; if the answer is between 10 and 19, the value for the current digit is the least significant digit of the sum and a 1 is carried (i.e., added) to the next digit (which has a value 10 times bigger than the current digit, so only a 1 and not a 10 is carried). In the full adder circuit, the Ci+1 signal represents a value of “2” in the stage where it is generated but only represents a “1” in the next stage because bit in that stage has a binary weight of twice the previous stage.
The ripple carry adder of FIG. 2C is often the least expensive in terms of silicon area to implement in hardware, but it has the disadvantage of being slow when wide numbers with lots of bits are being added because the speed limiter is the time it takes for the carry signal to propagate from the least significant bit to the most significant bit. This has prompted computer designers to look for alternative approaches which can add numbers faster than a ripple adder of the same width can.
One such attempt is the carry-select adder shown in FIG. 3A, which is known in the art. The technique involves doing the addition twice for each section of the adder: once assuming that the carry in equals 0 and once assuming the carry in equals 1. The carry input signal then goes to the select input of a multiplexer which selects the correct sum and carry outputs from the correct adder and presents them to the adder outputs. While this approach is slower for a single stage like that shown in FIG. 3A, a multistage adder constructed this way like the one shown in FIG. 3B with the C3 output of one stage coupled to the C0 input of the next greatly enhances performance because the worst case delay of each additional stage bypasses the adders and only involves the delay from the multiplexer select input to its output. In FIG. 3B only the carry multiplexers are shown to illustrate the critical path passes from C4 to C8 to C12 and C16 which completely bypasses the adders in all of the high order stages. The cost of this approach is an adder that takes roughly twice the area to implement since twice as many adder bits are required.
Another attempt known in the art is the carry-look-ahead adder shown in FIGS. 4A, 4B and 4C. The basic adder cell is shown in FIG. 4A and the truth table is shown in FIG. 4B. Its key feature is that it has no carry in signal, no carry out signal, and no sum out signal. Instead, they have been replaced with two outputs Gi and Pi. The Gi signal is known as the carry-generate signal. It has a value of logic one if a carry out equal to logic one will occur in a full adder like the one shown in FIG. 2A as a result solely of Ai and Bi. This can only occur if both Ai and Bi equal logic one, since this will result in a sum of either 2 or 3 depending on the carry input. Thus logic equation for the carry-generate signal is Gi=Ai AND Bi.
The Pi signal is known as the carry-propagate signal. It has a value of logic one if the carry signal would propagate from Ci to Ci+1 in a full adder like the one shown in FIG. 2A. This can only occur when one and only one of Ai or Bi equals logic one. Thus the logic equation for the carry-propagate signal is Pi=Ai XOR Bi.
Shown in FIG. 4C is a complete three-bit carry look-ahead adder. On the left are the basic cells for each bit and on the right is the necessary logic to implement the adder based upon the outputs of the basic cells. For each bit position, the equation for the sum output Si is Si=Pi XOR Ci=(Ai XOR Bi) XOR Ci, which is equivalent to the full adder logic in FIG. 2A.
The key feature for the carry logic is that the carry input Ci is generated for all stages simultaneously as a logical function of all the Gi signals, all the Pi signals, and the first stage carry input C0. Thus for very wide adders, the carry for each stage will propagate with the same number of gate delays for all bit positions making for a very fast adder at the cost of significant amount of logic.
The carry out signal for the first stage C1 will equal logic one if either a carry is generated in the first stage (e.g., G0=1) or if a carry is propagated from C0 through the first stage to C1 (e.g., P0 AND C0=1). Thus the logic equation is C1=G0 OR (P0 AND C0). The second stage is more complicated because there are more cases. The carry out signal for the second stage C2 will equal logic one if a carry is generated in the second stage, if a carry is generated in the first stage and propagated through the second stage, or if a carry is propagated from C0 through the first and second stages to C2. Thus the logic equation is C2=G1 OR (G0 AND P1) OR(C0 AND P0 AND P1). A similar line of reasoning applies to the carry output of the third stage in FIG. 4C and all subsequent stages in wider carry look-ahead adders.
Shown in FIG. 5 is a three bit carry-skip adder known in the art. This approach shares elements of the ripple adder of FIGS. 2A, 2B and 2C, the carry-select adder of FIGS. 3A and 3B, and the carry look-ahead adder of FIGS. 4A, 4B and 4C. Internal to the adder, the carry for individual bit positions is generated like a ripple adder for economy, a carry-propagate signal is generated for each stage, and the logical AND of all the carry-propagate bits is used to select between the carry input signal to the adder and the output of the internal ripple carry chain. Like the carry-select adder, when multiple stages are placed in a series carry arrangement, the delay of the second and subsequent stages is only the multiplexer delay since the internal adder logic is bypassed.
Because the carry skip adder inherently has a nice balance of economy and performance, variations of it have been used in a number of FPGA architectures, both flat and clustered. In clustered architectures, there has always been a historic limitation on the placement of adders in the clusters. Typically the cluster contains at most two carry-skip stages, and the least significant bit of an adder is restricted to being placed in the module where the carry input first enters the carry-skip stage. Like any irregularity in an FPGA architecture, giving some logic modules unique functionality relative to other logic modules creates a non-homogeneity that substantially complicates the implementation of the design software, particularly the place and route tool (or tools). The goal of the present invention is to eliminate the non-homogeneity issues associated with the use of adders in clustered FPGA architectures of the prior art.