The present invention relates to look up table (LUT) structures for programmable logic applications such as incrementers/decrementers.
In an FPGA, a complex logic design is broken down to smaller logic blocks and programmed into logic blocks provided in the FPGA. Logic blocks contain multiple smaller logic elements. Logic elements facilitate sequential and combinational logic design implementations. Combinational logic has no memory and outputs reflect a function solely of present input states. Sequential logic is implemented by inserting memory in the form of a flip-flop into the logic path to store past history. Current FPGA architectures include transistor pairs, NAND or OR gates, multiplexers, look-up-tables (LUT) and AND-OR structures in a basic logic element. In a PLD the basic logic element is labeled a macro-cell. Hereafter the terminology logic element will include both logic elements and macro-cells. Granularity of an FPGA refers to logic content in the basic logic block. Partitioned smaller blocks of a complex logic design are customized to fit into FPGA grain. In fine-grain architectures, one or a few small basic logic elements are grouped to form a basic logic block, then enclosed in a routing matrix and replicated. A fine grain logic element may contain a 2-input MUX or a 2-input LUT and a register. These offer easy logic fitting at the expense of complex routing. In course grain architectures, many larger logic elements are combined into a basic logic block with local routing. A course grain logic element may include a 4-input LUT with a register, and a logic block may include as many as 4 to 8 logic elements. The larger logic block is then replicated with a global routing matrix. Larger logic blocks make the logic fitting difficult and the routing easier. A challenge for FPGA architectures is to provide easy logic fitting (like fine grain) and maintain easy routing (like course grain). Course grain architectures are faster in logic operations and there is an increasing need in the IC industry to utilize larger logic blocks with multiple bigger LUT structures.
For sequential logic designs, the logic element may also include flip-flops. A MUX based exemplary logic element described in Ref-1 (Seals & Whapshott) is shown in FIG. 1A. The logic element has a built in D-flip-flop 105 for sequential logic implementation. In addition, elements 101, 102 and 103 are 2:1 MUX's controlled by one input signal for each MUX. Input S1 feeds into 101 and 102, while inputs S1 and S2 feed into OR gate 104, and the output from OR gate feeds into 103. Element 105 is the D-Flip-Flop receiving Preset, Clear and Clock signals. One may very easily represent the programmable MUX structure in FIG. 1A as a 2-input LUT; where A, B, C & D are LUT values, and S1, (S2+S3) are LUT inputs. Ignoring the global Preset & Clear signals, eight inputs feed into the logic block, and one output leaves the logic block. All 2-input, all 3-input and some 4-input variable functions are realized in the logic block and latched to the D-Flip-Flop. Inputs and outputs for the Logic Element or Logic Block are selected from the programmable Routing Matrix. An exemplary routing matrix containing logic elements as described in Ref-1 is shown in FIG. 1B. Each logic element 112 is as shown in FIG. 1A. The 8 inputs and 1 output from logic element 112 in FIG. 1B are routed to 22 horizontal and 12 vertical interconnect wires that have programmable via connections 110. These connections 110 may be anti-fuses or pass-gate transistors controlled by SRAM memory elements. The user selects how the wires are connected during the design phase, and programs the connections in the field. FPGA architectures for various commercially available FPGA devices are discussed in Ref-1 (Seals & Whapshott) and Ref-2 (Sharma).
Logic implementation in logic elements is achieved by converting a logic equation or a truth table to a gate realization. The gate level description comprising elements and nets is also called a netlist. The resulting logic gates are ported to LUT or MUX structure in the logic element. An exemplary truth table and a plurality of transistor gate realizations are shown in FIG. 2. In FIG. 2A, a truth table of 4 input variables, A, B, C & D is shown. By grouping the logic ones in the table, the output function can be expressed as AND & OR functions of inputs as shown by the logic equation in FIG. 2A. An exemplary MUX implementation of the logic function is shown in FIG. 2B. The MUX has 3-control variables A, B and C, and the fourth variable D together with D′ (not D), logic one and logic zero are used as inputs to the MUX. The inputs can be hard-wired or provided as programmable options. The MUX comprises a plurality of pass-gates 201. For a 3-variable hard-wired MUX, only 14 pass-gates such as 201 are needed. This is a very efficient implementation of hard-wired logic. Any 4-variable truth table can be realized by the 3-control variable MUX as shown in FIG. 2B by wiring the input values accordingly. The inputs to a programmable MUX logic element can be provided as shown in FIG. 2C. There is considerable overhead to make the MUX inputs user programmable. In FIG. 2C, two programmable memory bits such as 202 per input are configured to couple the desired input value to I1. Combining the two figures in FIGS. 2B & 2C, one can see that a 4-input programmable MUX utilizes 62 pass-gates such as 201 and 16 memory bits such as 202. For 6T CMOS SRAM memory, each memory bit occupies 4 NMOS gates and 2 PMOS gates. Hence a programmable 4-input MUX implementation takes up 158 transistors. In anti-fuse technology, each input wire connection can be built into a programmable anti-fuse between two metal lines. That requires only decoding transistors at the end of wire segments to program the anti-fuse elements, thus saving Silicon area. Hence a programmable MUX as shown in FIG. 2B is not popular for SRAM based FPGAs, whereas it is a logical choice for anti-fuse based FPGAs.
AND/OR realization of the logic function in FIG. 2A is shown in FIG. 2D. There are five 3-input AND gates and one 5-input OR gate to generate the required F output. In full CMOS implementation, each 3-input AND is 6 transistors, while 5-input OR is 10 transistors. Hence the AND/OR gate realization in FIG. 2D takes up 40 transistors. The Silicon area is also impacted by the latch-up related N-Well rules that mandate certain spacing restrictions between NMOS and PMOS transistors. For this example, the hard-wire MUX implementation took less gates compared to the hard-wire AND/OR gate implementation, while the programmable MUX took a considerable overhead.
Commercially available FPGAs use 3-input and 4-input look up tables (LUT). The more popular 4-input LUT implementation of the truth table in FIG. 2A is shown in FIG. 2E. Any 4-input function can be implemented in FIG. 2E by setting the LUT values. In this disclosure, we will name this a 4LUT, where the word input is dropped for convenience and the number of inputs is pre-fixed to the word LUT. The 4LUT has 16 LUT values, which can be hard-wired or programmable. LUT and MUX construction of logic elements are very similar and both are commercially used in FPGA & Gate Array products as shown in Ref-1 & Ref-2. There are 30 pass-gates (such as 201) in FIG. 2E for the hard-wire 4LUT. This 30 gate 4LUT is larger than a 14 gate hard-wire MUX, but smaller than the 40 gate hard-wire AND/OR logic implementation. The 16 LUT values in the 4LUT determine the LUT function. Using 16 programmable registers such as 202 for these inputs allows the 4LUT to be user programmable. The 16 memory elements, in both programmable MUX and LUT options, utilize 96 extra transistors when implemented in 6T CMOS SRAM. Hence the programmable 4LUT with 126 transistors is more economical compared to the programmable MUX option with 158 transistors. Thus LUT logic is extensively used in SRAM based FPGAs while MUX logic is used in anti-fuse based FPGAs and Gate Arrays.
FPGA and Gate Array architectures are discussed in Carter U.S. Pat. No. 4,706,216, Freemann U.S. Pat. No. 4,870,302, ElGamal et al. U.S. Pat. No. 4,873,459, Freemann et al. U.S. Pat. No. 5,488,316 & U.S. Pat. No. 5,343,406, Trimberger et al. U.S. Pat. No. 5,844,422, Cliff et al. U.S. Pat. No. 6,134,173, Wittig et al. U.S. Pat. No. 6,208,163, Or-Bach US 2001/003428, Mendel U.S. Pat. No. 6,275,065, Lee et al. US 2001/0048320, Or-Bach U.S. Pat. No. 6,331,789, Young et al. U.S. Pat. No. 6,448,808, Sueyoshi et al. US 2003/0001615, Agrawal et al. US 2002/0186044, Sugibayashi et al. U.S. Pat. No. 6,515,511 and Pugh et al. US 2003/0085733. These patents disclose programmable MUX and programmable LUT structures to build logic elements that are user configurable. In all cases a routing block is used to provide inputs and outputs for these logic elements, while the logic element is programmed to perform a specific logic function. The routing-block is a hard-wire connection for Gate Array and Structured ASIC devices. Within a logic element, each LUT is hard-wired to a specific size, said size determined by the number of LUT inputs. This LUT is the smallest building block in the logic element and cannot be sub-divided. As an example, a smaller 2-input logic function would occupy a 4LUT, if that is the smallest element available. That leads to Silicon utilization inefficiency. Within a logic block, multiple logic elements are grouped together in a pre-defined manner. The size of the logic block determines the granularity. As manufacturing geometries shrink, the FPGA granularity gets larger, the LUT size increases and the number of LUTs per logic block has to increase. Having a large fixed LUT in the logic element further aggravates the Silicon utilization efficiency and is not flexible for next generation FPGA designs.
As the LUT structure gets large, the logic porting becomes more difficult and Silicon utilization gets more inefficient. To illustrate LUT utilization efficiency, in FIG. 3 we provide the pass-gate construction required to build 1LUT, 2LUT, 3LUT, 4LUT and 5LUT logic elements. FIG. 3A shows a 1LUT comprising of two pass-gates 301 & 302, two LUT values contained in two programmable registers 303 & 304 and one input variable “A” in true and compliment. A 1LUT is simply a 2:1 MUX selecting one of two register values. Any 1-input function such as 2:1 MUX, Logic-0, Logic-0, TRUE and INVERT can be realized by this 1LUT by programming the two LUT values. Signal A allows the LUT values in either 303 or 304 to reach output F. There is a time delay for this to occur. That is a characteristic 1LUT delay time, which is optimized by sizing the transistors 301 and 302 as needed. Faster time requires wider transistors. The symbol for 1LUT is shown in FIG. 3B, and this symbol is used to illustrate higher LUT constructions in FIG. 3C thru FIG. 3F.
A 2LUT is shown in FIG. 3C that can realize any 2-input function such as AND, NAND, OR, NOR, XOR among others. As shown in FIG. 3C, the 2LUT can be constructed by hard-wiring three 1LUTs 311, 312 & 313 as shown. This is termed a LUT cone or a LUT tree and comprises two stages. First stage has 1LUT 311 and 312 sharing a common input, while second stage has 1LUT 313. Only the 1LUTs in the first stage 311 and 312 have LUT values. LUT outputs from first stage are fed as LUT values to second stage. These are hard-wire connections. In FIG. 3C, 1LUT outputs from 311 and 312 are fed as LUT values to 1LUT 313. A 2LUT delay comprises the time taken for a LUT value in the first stage to reach F. There are now two pass-gates in series, and this delay is larger than for a 1LUT. Thus the pass-gates need to be wider to reduce the LUT delay. That increase in area and slow down in performance hurt LUT logic trees. Similarly, 3LUT, 4LUT and 5LUT constructions with 1LUTs are shown in FIG. 3D, FIG. 3E and FIG. 3F respectively. Those pass-gates have to be even wider to improve LUT delays. The 5LUT in FIG. 3F has 16 1LUTs in the first stage, 8 1LUTs in the second stage, 4 1LUTs in the third stage, 2 1 LUTs in the fourth stage and one 1 LUT in the final fifth stage. A total of 31 1 LUTs are used in FIG. 3F for the 5LUT construction. A K-LUT cone or a K-LUT tree has K-input variables, K-stages and 2K LUT values to realize a K-input function. Each stage has one common input variable. 2(K-1) outputs from first stage feed as LUT values into second stage. Consecutive LUT value reduction continues until the last stage, when only 2 LUT values feed the last stage, and one LUT output is obtained. The equivalent 1LUTs required to build a K-LUT is tabulated in FIG. 3G, and is shown to grow as (2K−1). Logic porting to K-LUT is discussed by Ahmed et al. (Ref-3) for multiple K values. They have looked at porting 20 benchmark logic designs into varying LUT sizes: 1LUT, 2LUT, 3LUT, 4LUT, 5LUT, 6LUT and 7LUT. The geometric average number of K-LUTs required for porting 20 designs, as shown in FIG. 10 in Ref-2, is tabulated in the first 2 columns of FIG. 4. As can be seen, as the size of the K-LUT increases, the total number of K-LUTs required to fit an average design decreases. In addition, FIG. 4 also lists the equivalent 1LUT per K-LUT (from FIG. 3G) in column 3, and calculates the equivalent 1LUTs required for the design in column 4. Column 4 values are obtained by multiplying values in column 2 by values in column 3. In FIG. 4, each row represents how many K-LUTs are required for an average design, and an equivalent 1LUT calculation as a measure of Silicon utilization. 2LUT implementation in row-1 needs only 12900 1LUTs, while the 7LUT implementation in row-6 needs 177800 1LUTs for the same design. The latter 7LUT has only 7.3% Silicon utilization efficiency compared to the former 2LUT. From row-3, commercially available FPGAs with 4LUTs are seen only 36.1% efficient compared to 2LUTs at fitting logic. As the LUT size gets larger, clearly a more efficient LUT circuit is needed to improve Silicon utilization in LUT based logic elements.
LUT based logic elements are used in conjunction with programmable point to point connections. Four exemplary methods of programmable point to point connections, synonymous with programmable switches, between node A and node B are shown in FIG. 5. A configuration circuit to program the connection is not shown in FIG. 5. All the patents listed under FPGA architectures use one or more of these basic programmable connections. In FIG. 5A, a conductive fuse link 510 connects A to B. It is normally connected, and passage of a high current or exposure to a laser beam will blow the conductor open. In FIG. 5B, a capacitive anti-fuse element 520 disconnects A from B. It is normally open, and passage of a high current will pop the insulator shorting the two terminals. Fuse and anti-fuse are both one time programmable due to the non-reversible nature of the change. In FIG. 5C, a pass-gate device 530 connects A to B. The gate signal S0 determines the nature of the connection, on or off. This is a non destructive change. The gate signal is generated by manipulating logic signals, or by configuration circuits that include memory. The choice of memory varies from user to user. In FIG. 5D, a floating-pass-gate device 540 connects A to B. Control gate signal S0 couples a portion of that to floating- gate. Electrons trapped in the floating gate determines an on or off state for the connection. Hot-electrons and Fowler-Nordheim tunneling are two mechanisms for injecting charge to floating-gates. When high quality insulators encapsulate the floating gate, trapped charge stays for over 10 years. These provide non-volatile memory. EPROM, EEPROM and Flash memory employ floating-gates and are non-volatile. Anti-fuse and SRAM based architectures are widely used in commercial FPGA's, while EPROM, EEPROM, anti-fuse and fuse links are widely used in commercial PLD's. Volatile SRAM memory needs no high programming voltages, is freely available in every logic process, is compatible with standard CMOS SRAM memory, lends to process and voltage scaling and has become the de-facto choice for modern day very large FPGA device construction.
LUT based logic elements are used to implement carry logic. Such logic elements and logic blocks are disclosed in U.S. Pat. No. 5,274,581; U.S. Pat. No. 5,386,156; U.S. Pat. No. 5,481,486; U.S. Pat. No. 5,815,726; US RE35,977; U.S. Pat. No. 5,926,036; U.S. Pat. No. 6,107,822; U.S. Pat. No. 6,271,680; U.S. Pat. No. 6,353,920; U.S. Pat. No. 6,807,556; U.S. Pat. No. 6,888,373; U.S. Pat. No. 6,937,064; U.S. Pat. No. 7,030,650; U.S. Pat. No. 7,061,268; U.S. Pat. No. 7,167,021; U.S. Pat. No. 7,111,214 and U.S. Pat. No. 7,193,433. Such implementations offer techniques for programmable fabrics to be configured to perform carry logic in addition to basic LUT based logic. In all cases, extra AND, XOR type gates are added to modify simple LUT structures to enable carry logic. When LUTs are not used for carry logic, which is most of the logic in programmable logic fabric, these extra gates are not used and extra Si is wasteful and costly to the user. When high input LUTs are used to compute fewer input functions, Si area is wasted. Implementation of carry logic comprises two key elements: total usage of logic blocks, and number stages for carry logic implementation. The first determines cost and the latter determines performance of carry function. For a LUT based logic block, it is further beneficial for most of the LUT gates to be occupied by the carry logic. Specifically McElvain in U.S. Pat. No. 6,807,556 discloses a method to parallelize carry chain implementation to reduce performance, but there is no improvement to cost (i.e. same number of logic elements as no-parallel implementation) as the LUT logic blocks are under-utilized. What is desirable is to achieve carry functions that add little to no extra overhead to typical non-carry based LUT logic elements, use fewer total LUT logic blocks to reduce cost and require fewer stages to improve performance over the best prior-art parallelized scheme in implementing carry logic functions in an FPGA. Specifically, a faster and cheaper incrementer/decrementer implementation (add a 1 to a data string comprising a word, subtract a 1 from a word) using LUT logic elements is highly desirable.
All commercially available high density FPGA's use SRAM memory elements. A volatile six transistor SRAM based configuration circuit is shown in FIG. 6A. The SRAM memory element can be any one of 6-transistor, 5-transistor, full CMOS, R-load or TFT PMOS load based cells to name a few. Two inverters 603 and 604 connected back to back forms the memory element. This memory element is a latch providing complementary outputs S0 and S0′. The latch can be constructed as full CMOS, R-load, PMOS load or any other. Power and ground terminals for the inverters are not shown in FIG. 6A. Access NMOS transistors 601 and 602, and access wires GA, GB, BL and BS provide the means to configure the memory element. Applying zero and one on BL and BS respectively, and raising GA and GB high enables writing zero into device 601 and one into device 602. The output S0 delivers a logic one. Applying one and zero on BL and BS respectively, and raising GA and GB high enables writing one into device 601 and zero into device 602. The output S0 delivers a logic zero. The SRAM construction may allow applying only a zero signal at BL or BS to write data into the latch. The SRAM cell may have only one access transistor 601 or 602. The SRAM latch will hold the data state as long as power is on. When the power is turned off, the SRAM bit needs to be restored to its previous state from an outside permanent memory. In the literature for programmable logic, this second non-volatile memory is also called configuration memory. Upon power up, an external or an internal CPU loads the external configuration memory to internal configuration memory locations. All of FPGA functionality is controlled by the internal configuration memory. The SRAM configuration circuit in FIG. 6A controlling logic pass-gate is illustrated in FIG. 6B. Element 650 represents the configuration circuit. The S0 output directly driven by the memory element shown in FIG. 6A drives the pass-gate 610 gate electrode. In addition to S0 output and the memory cell, power, ground, data-in and write-enable signals in 650 constitutes the SRAM configuration circuit. Write enable circuitry includes GA, GB, BL, BS signals shown in FIG. 6A.
As discussed earlier, providing programmability is a very severe transistor and cost penalty compared to hard-wired Gate Array or ASIC implementation of identical logic. A significant factor in the penalty comes from the 6-transistors required for the configuration circuits. The natural conclusion is to minimize the number of configurable bits used in the programmable logic element. This mandates constructing a hard-wired larger 6LUT or a bigger LUT for next generation FPGAs. We have shown that Silicon utilization is severely impacted with this move towards larger LUT structures in logic elements. What is desirable is to have an economical and flexible LUT macro-cell, or a macro-LUT circuit. This LUT macro-cell should efficiently implement logic functions. Both large logic functions that port to one big LUT and small logic functions that port to multiple smaller LUTs should fit easily into a LUT macro-cell. Furthermore, LUT logic packing should maximize Silicon utilization to keep programmable logic cost reasonable with other hard-wired IC manufacturing choices. The user should be able to take a synthesized netlist from an ASIC flow, typically comprising smaller logic blocks, convert this netlist to fit in the FPGA granularity, place and route logic economically and efficiently. This would make use of existing third party ASIC tools at the front-end logic design and streamline tool flow for FPGA place & routing.
For an emulation device, the cost of programmability is not the primary concern if such a device provides a migration path to a lower cost. Today an FPGA migration to a Gate Array requires a new design to ensure timing closure. A desirable migration path is to keep the timing of the original FPGA design intact. That would avoid valuable re-engineering time, opportunity costs and time to solution (TTS). Such a conversion should occur in the same base die to avoid Silicon and system re-qualification costs and implementation delays. Such a conversion should also realize an end product that is competitive with an equivalent standard cell ASIC or a Gate Array product in cost and performance. Such an FPGA device will also target applications that are cost sensitive, have short life cycles and demand high volumes.