The present invention relates to look up table (LUT) structures for programmable logic applications. More specifically, it relates to programmable LUT structures capable of implementing efficient and fast carry logic.
Traditionally, application specific integrated circuit (ASIC) devices have been used in the integrated circuit (IC) industry to reduce cost, enhance performance or meet space constraints. The generic class of ASIC devices falls under a variety of sub classes such as Custom ASIC, Standard cell ASIC, Gate Array and Field Programmable Gate Array (FPGA) where the degree of user allowed customization varies. In this disclosure the word ASIC is used only in reference to Custom and Standard Cell ASICs where the designer has to incur the cost of a full fabrication mask set. The term FPGA denotes an off the shelf programmable device with no fabrication mask costs, and Gate Array denotes a device with partial mask costs to the designer. The devices FPGA include Programmable Logic Devices (PLD) and Complex Programmable Logic Devices (CPLD), while the devices Gate Array include Laser Programmable Gate Arrays (LPGA), Mask Programmable Gate Arrays (MPGA) and a new class of devices known as Structured ASIC or Structured Arrays.
The design and fabrication of ASICs can be time consuming and expensive. The customization involves a lengthy design cycle during the product definition phase and high Non Recurring Engineering (NRE) costs during manufacturing phase. In the event of finding a logic error in the custom or semi-custom ASIC during final test phase, the design and fabrication cycle has to be repeated. Such lengthy correction cycles further aggravate the time to market and engineering cost. As a result, ASICs serve only specific applications and are custom built for high volume and low cost. The high cost of masks and unpredictable device life time shipment volumes have caused ASIC design starts to fall precipitously in the IC industry. ASICs offer no device for immediate design verification, no interactive design adjustment capability, and require a full mask set for fabrication.
Gate Array customizes pre-defined modular blocks at a reduced NRE cost by designing the module connections with a software tool similar to that in ASIC. The Gate Array has an array of non programmable (or moderately programmable) functional modules fabricated on a semiconductor substrate. To interconnect these modules to a user specification, multiple layers of wires are used during design synthesis. The level of customization may be limited to a single metal layer, or single via layer, or multiple metal layers, or multiple metals and via layers. The goal is to reduce the customization cost to the user, and provide the customized product faster. As a result, the customizable layers are designed to be the top most metal and via layers of a semiconductor fabrication process. This is an inconvenient location to customize wires. The customized transistors are located at the substrate level of the Silicon. All possible connections have to come up to the top level metal. The complexity of bringing up connections is a severe constraint for these devices. Structured ASICs fall into larger module Gate Arrays. These devices have varying degrees of complexity in the structured cell and varying degrees of complexity in the custom interconnection. The absence of Silicon for design verification and design optimization results in multiple spins and lengthy design iterations to the end user. The Gate Array evaluation phase is no different to that of an ASIC. The advantage over ASIC is in a lower upfront NRE cost for the fewer customization layers, tools and labor, and the shorter time to receive the finished product. Gate Arrays offer no device for immediate design verification, no interactive design adjustment capability, and require a partial mask set for fabrication. Compared to ASICs, Gate Arrays offer a lower initial cost and a faster turn-around to debug the design. The end IC is more expensive compared to an ASIC.
In recent years there has been a move away from custom, semi-custom and Gate Array ICs toward field programmable components whose function is determined not when the integrated circuit is fabricated, but by an end user “in the field” prior to use. Off the shelf FPGA products greatly simplify the design cycle and are fully customized by the user. These products offer user-friendly software to fit custom logic into the device through programmability, and the capability to tweak and optimize designs to improve Silicon performance. Provision of this programmability is expensive in terms of Silicon real estate, but reduces design cycle time, time to solution (TTS) and upfront NRE cost to the designer. FPGAs offer the advantages of low NRE costs, fast turnaround (designs can be placed and routed on an FPGA in typically a few minutes), and low risk since designs can be easily amended late in the product design cycle. It is only for high volume production runs that there is a cost benefit in using the other two approaches. Compared to FPGA, an ASIC and Gate Array both have hard-wired logic connections, identified during the chip design phase. ASIC has no multiple logic choices and both ASIC and most Gate Arrays have no configuration memory to customize logic. This is a large chip area and a product cost saving for these approaches to design. Smaller die sizes also lead to better performance. A full custom ASIC has customized logic functions which take less gate counts compared to Gate Arrays and FPGA configurations of the same functions. Thus, an ASIC is significantly smaller, faster, cheaper and more reliable than an equivalent gate-count FPGA. A Gate Array is also smaller, faster and cheaper compared to an equivalent FPGA. The trade-off is between time-to-market (FPGA advantage) versus low cost and better reliability (ASIC advantage). A Gate Array falls in the middle with an improvement in the ASIC NRE cost at a moderate penalty to product cost and performance. The cost of Silicon real estate for programmability provided by the FPGA compared to ASIC and Gate Array contribute to a significant portion of the extra cost the user has to bear for customer re-configurability in logic functions.
In an FPGA, a complex logic design is broken down to smaller logic blocks and programmed into logic blocks provided in the FPGA. Logic blocks contain multiple smaller logic elements. Logic elements facilitates sequential and combinational logic design implementations. Combinational logic has no memory and outputs reflect a function solely of present input states. Sequential logic is implemented by inserting memory in the form of a flip-flop into the logic path to store past history. Current FPGA architectures include transistor pairs, NAND or OR gates, multiplexers, look-up-tables (LUT) and AND-OR structures in a basic logic element. In a PLD the basic logic element is labeled a macro-cell. Hereafter the terminology logic element will include both logic elements and macro-cells. Granularity of an FPGA refers to logic content in the basic logic block. Partitioned smaller blocks of a complex logic design are customized to fit into FPGA grain. In fine-grain architectures, one or a few small basic logic elements are grouped to form a basic logic block, then enclosed in a routing matrix and replicated. A fine grain logic element may contain a 2-input MUX or a 2-input LUT and a register. These offer easy logic fitting at the expense of complex routing. In course grain architectures, many larger logic elements are combined into a basic logic block with local routing. A course grain logic element may include a 4-input LUT with a register, and a logic block may include as many as 4 to 8 logic elements. The larger logic block is then replicated with a global routing matrix. Larger logic blocks make the logic fitting difficult and the routing easier. A challenge for FPGA architectures is to provide easy logic fitting (like fine grain) and maintain easy routing (like course grain). Course grain architectures are faster in logic operations and there is an increasing need in the IC industry to utilize larger logic blocks with multiple bigger LUT structures.
For sequential logic designs, the logic element may also include flip-flops. A MUX based exemplary logic element described in Ref-1 (Seals & Whapshott) is shown in FIG.-1A. The logic element has a built in D-flip-flop 105 for sequential logic implementation. In addition, elements 101, 102 and 103 are 2:1 MUX's controlled by one input signal for each MUX. Input S1 feeds into 101 and 102, while inputs S1 and S2 feeds into OR gate 104, and the output from OR gate feeds into 103. Element 105 is the D-Flip-Flop receiving Preset, Clear and Clock signals. One may very easily represent the programmable MUX structure in FIG.-1A as a 2-input LUT; where A, B, C & D are LUT values, and S1, (S2+S3) are LUT inputs. Ignoring the global Preset & Clear signals, eight inputs feed into the logic block, and one output leaves the logic block. All 2-input, all 3-input and some 4-input variable functions are realized in the logic block and latched to the D-Flip-Flop. Inputs and outputs for the Logic Element or Logic Block are selected from the programmable Routing Matrix. An exemplary routing matrix containing logic elements as described in Ref-1 is shown in FIG.-1B. Each logic element 112 is as shown in FIG.-1A. The 8 inputs and 1 output from logic element 112 in FIG.-1B are routed to 22 horizontal and 12 vertical interconnect wires that have programmable via connections 110. These connections 110 may be anti-fuses or pass-gate transistors controlled by SRAM memory elements. The user selects how the wires are connected during the design phase, and programs the connections in the field. FPGA architectures for various commercially available FPGA devices are discussed in Ref-1 (Seals & Whapshott) and Ref-2 (Sharma).
Logic implementation in logic elements is achieved by converting a logic equation or a truth table to a gate realization. The gate level description comprising elements and nets is also called a netlist. The resulting logic gates are ported to LUT or MUX structure in the logic element. An exemplary truth table and a plurality of transistor gate realizations are shown in FIG.-2. In FIG.-2A, a truth table of 4 input variables, A, B, C & D is shown. By grouping the logic ones in the table, the output function can be expressed as AND & OR functions of inputs as shown by the logic equation in FIG.-2A. An exemplary MUX implementation of the logic function is shown in FIG.-2B. The MUX has 3-control variables A, B and C, and the fourth variable D together with D′ (not D), logic one and logic zero are used as inputs to the MUX. The inputs can be hard-wired or provided as programmable options. The MUX comprises a plurality of pass-gates 201. For a 3-variable hard-wired MUX, only 14 pass-gates such as 201 are needed. This is a very efficient implementation of hard-wired logic. Any 4-variable truth table can be realized by the 3-control variable MUX as shown in FIG.-2B by wiring the input values accordingly. The inputs to a programmable MUX logic element can be provided as shown in FIG.-2C. There is considerable overhead to make the MUX inputs user programmable. In FIG.-2C, two programmable memory bits such as 202 per input are configured to couple the desired input value to I1. Combining the two figures in FIG.-2B & 2C, one can see that a 4-input programmable MUX utilizes 62 pass-gates such as 201 and 16 memory bits such as 202. For 6T CMOS SRAM memory, each memory bit occupies 4 NMOS gates and 2 PMOS gates. Hence a programmable 4-input MUX implementation takes up 158 transistors. In anti-fuse technology, each input wire connection can be built into a programmable anti-fuse between two metal lines. That requires only decoding transistors at the end of wire segments to program the anti-fuse elements, thus saving Silicon area. Hence a programmable MUX as shown in FIG.-2B is not popular for SRAM based FPGAs, whereas it is a logical choice for anti-fuse based FPGAs.
AND/OR realization of the logic function in FIG.-2A is shown in FIG.-2D. There are five 3-input AND gates and one 5-input OR gate to generate the required F output. In full CMOS implementation, each 3-input AND is 6 transistors, while 5-input OR is 10 transistors. Hence the AND/OR gate realization in FIG.-2D takes up 40 transistors. The Silicon area is also impacted by the latch-up related N-Well rules that mandate certain spacing restrictions between NMOS and PMOS transistors. For this example, the hard-wire MUX implementation took less gates compared to the hard-wire AND/OR gate implementation, while the programmable MUX took a considerable overhead.
Commercially available FPGAs use 3-input and 4-input look up tables (LUT). The more popular 4-input LUT implementation of the truth table in FIG.-2A is shown in FIG.-2E. Any 4-input function can be implemented in FIG.-2E by setting the LUT values. In this disclosure, we will name this a 4LUT, where the word input is dropped for convenience and the number of inputs is pre-fixed to the word LUT. The 4LUT has 16 LUT values, which can be hard-wired or programmable. LUT and MUX construction of logic elements are very similar and both are commercially used in FPGA & Gate Array products as shown in Ref-1 & Ref-2. There are 30 pass-gates (such as 201) in FIG.-2E for the hard-wire 4LUT. This 30 gate 4LUT is larger than a 14 gate hard-wire MUX, but smaller than the 40 gate hard-wire AND/OR logic implementation. The 16 LUT values in the 4LUT determine the LUT function. Using 16 programmable registers such as 202 for these inputs allows the 4LUT to be user programmable. The 16 memory elements, in both programmable MUX and LUT options, utilize 96 extra transistors when implemented in 6T CMOS SRAM. Hence the programmable 4LUT with 126 transistors is more economical compared to the programmable MUX option with 158 transistors. Thus LUT logic is extensively used in SRAM based FPGAs while MUX logic is used in anti-fuse based FPGAs and Gate Arrays.
FPGA and Gate Array architectures are discussed in Carter U.S. Pat. No. 4,706,216, Freemann U.S. Pat. No. 4,870,302, ElGamal et al. U.S. Pat. No. 4,873,459, Freemann et al. U.S. Pat. No. 5,488,316 & U.S. Pat. No. 5,343,406, Trimberger et al. U.S. Pat. No. 5,844,422, Cliff et al. U.S. Pat. No. 6,134,173, Wittig et al. U.S. Pat. No. 6,208,163, Or-Bach U.S. Ser. No. 2001/003428, Mendel U.S. Pat. No. 6,275,065, Lee et al. U.S. Ser. No. 2001/0048320, Or-Bach U.S. Pat. No. 6,331,789, Young et al. U.S. Pat. No. 6,448,808, Sueyoshi et al. U.S. Ser. No. 2003/0001615, Agrawal et al. U.S. Ser. No. 2002/0186044, Sugibayashi et al. U.S. Pat. No. 6,515,511 and Pugh et al. U.S. Ser. No. 2003/0085733. These patents disclose programmable MUX and programmable LUT structures to build logic elements that are user configurable. In all cases a routing block is used to provide inputs and outputs for these logic elements, while the logic element is programmed to perform a specific logic function. The routing-block is a hard-wire connection for Gate Array and Structured ASIC devices. Within a logic element, each LUT is hard-wired to a specific size, said size determined by the number of LUT inputs. This LUT is the smallest building block in the logic element and cannot be sub-divided. As an example, a smaller 2-input logic function would occupy a 4LUT, if that is the smallest element available. That leads to Silicon utilization inefficiency. Within a logic block, multiple logic elements are grouped together in a pre-defined manner. The size of the logic block determines the granularity. As manufacturing geometries shrink, the FPGA granularity gets larger, the LUT size increases and the number of LUTs per logic block has to increase. Having a large fixed LUT in the logic element further aggravates the Silicon utilization efficiency and is not flexible for next generation FPGA designs.
As the LUT structure gets large, the logic porting becomes more difficult and Silicon utilization gets more inefficient. To illustrate LUT utilization efficiency, in FIG.-3 we provide the pass-gate construction required to build 1LUT, 2LUT, 3LUT, 4LUT and 5LUT logic elements. FIG.-3A shows a 1LUT comprising of two pass-gates 301 & 302, two LUT values contained in two programmable registers 303 & 304 and one input variable “A” in true and compliment. A 1LUT is simply a 2:1 MUX selecting one of two register values. Any 1-input function such as 2:1 MUX, Logic-1, Logic-0, TRUE and INVERT can be realized by this 1LUT by programming the two LUT values. Signal A allows the LUT values in either 303 or 304 to reach output F. There is a time delay for this to occur. That is a characteristic 1LUT delay time, which is optimized by sizing the transistors 301 and 302 as needed. Faster time requires wider transistors. The symbol for 1LUT is shown in FIG.-3B, and this symbol is used to illustrate higher LUT constructions in FIG.-3C thru FIG.-3F.
A 2LUT is shown in FIG.-3C that can realize any 2-input function such as AND, NAND, OR, NOR, XOR among others. As shown in FIG.-3C, the 2LUT can be constructed by hard-wiring three 1LUTs 311, 312 & 313 as shown. This is termed a LUT cone or a LUT tree and comprises two stages. First stage has 1LUT 311 and 312 sharing a common input, while second stage has 1LUT 313. Only the 1LUTs in the first stage 311 and 312 have LUT values. LUT outputs from first stage are fed as LUT values to second stage. These are hard-wire connections. In FIG.-3C, 1LUT outputs from 311 and 312 are fed as LUT values to 1LUT 313. A 2LUT delay comprises the time taken for a LUT value in the first stage to reach F. There are now two pass-gates in series, and this delay is larger than for a 1LUT. Thus the pass-gates need to be wider to reduce the LUT delay. That increase in area and slow down in performance hurt LUT logic trees. Similarly, 3LUT, 4LUT and 5LUT constructions with 1LUTs are shown in FIG.-3D, FIG.-3E and FIG.-3F respectively. Those pass-gates have to be even wider to improve LUT delays. The 5LUT in FIG.-3F has 16 1LUTs in the first stage, 8 1LUTs in the second stage, 4 1LUTs in the third stage, 2 1LUTs in the fourth stage and one 1LUT in the final fifth stage. A total of 31 1LUTs are used in FIG.-3F for the 5LUT construction. A K-LUT cone or a K-LUT tree has K-input variables, K-stages and 2K LUT values to realize a K-input function. Each stage has one common input variable. 2(K-1) outputs from first stage feed as LUT values into second stage. Consecutive LUT value reduction continues until the last stage, when only 2 LUT values feed the last stage, and one LUT output is obtained. The equivalent 1LUTs required to build a K-LUT is tabulated in FIG.-3G, and is shown to grow as (2K-1). Logic porting to K-LUT is discussed by Ahmed et al. (Ref-3) for multiple K values. They have looked at porting 20 benchmark logic designs into varying LUT sizes: 1LUT, 2LUT, 3LUT, 4LUT, 5LUT, 6LUT and 7LUT. The geometric average number of K-LUTs required for porting 20 designs, as shown in FIG.-10 in Ref-2, is tabulated in the first 2 columns of FIG.-4. As can be seen, as the size of the K-LUT increases, the total number of K-LUTs required to fit an average design decreases. In addition, FIG.-4 also lists the equivalent 1LUT per K-LUT (from FIG.-3G) in column 3, and calculates the equivalent 1LUTs required for the design in column 4. Column 4 values are obtained by multiplying values in column 2 by values in column 3. In FIG.-4, each row represents how many K-LUTs are required for an average design, and an equivalent 1LUT calculation as a measure of Silicon utilization. 2LUT implementation in row-1 needs only 12900 1LUTs, while the 7LUT implementation in row-6 needs 177800 1LUTs for the same design. The latter 7LUT has only 7.3% Silicon utilization efficiency compared to the former 2LUT. From row-3, commercially available FPGAs with 4LUTs are seen only 36.1% efficient compared to 2LUTs at fitting logic. As the LUT size gets larger, clearly a more efficient LUT circuit is needed to improve Silicon utilization in LUT based logic elements.
LUT based logic elements are used in conjunction with programmable point to point connections. Four exemplary methods of programmable point to point connections, synonymous with programmable switches, between node A and node B are shown in FIG.-5. A configuration circuit to program the connection is not shown in FIG.-5. All the patents listed under FPGA architectures use one or more of these basic programmable connections. In FIG.-5A, a conductive fuse link 510 connects A to B. It is normally connected, and passage of a high current or exposure to a laser beam will blow the conductor open. In FIG.-5B, a capacitive anti-fuse element 520 disconnects A from B. It is normally open, and passage of a high current will pop the insulator shorting the two terminals. Fuse and anti-fuse are both one time programmable due to the non-reversible nature of the change. In FIG.-5C, a pass-gate device 530 connects A to B. The gate signal S0 determines the nature of the connection, on or off. This is a non destructive change. The gate signal is generated by manipulating logic signals, or by configuration circuits that include memory. The choice of memory varies from user to user. In FIG.-5D, a floating-pass-gate device 540 connects A to B. Control gate signal So couples a portion of that to floating gate. Electrons trapped in the floating gate determines an on or off state for the connection. Hot-electrons and Fowler-Nordheim tunneling are two mechanisms for injecting charge to floating-gates. When high quality insulators encapsulate the floating gate, trapped charge stays for over 10 years. These provide non-volatile memory. EPROM, EEPROM and Flash memory employ floating-gates and are non-volatile. Anti-fuse and SRAM based architectures are widely used in commercial FPGA's, while EPROM, EEPROM, anti-fuse and fuse links are widely used in commercial PLD's. Volatile SRAM memory needs no high programming voltages, is freely available in every logic process, is compatible with standard CMOS SRAM memory, lends to process and voltage scaling and has become the de-facto choice for modern day very large FPGA device construction.
All commercially available high density FPGA's use SRAM memory elements. A volatile six transistor SRAM based configuration circuit is shown in FIG.-6A. The SRAM memory element can be any one of 6-transistor, 5-transistor, full CMOS, R-load or TFT PMOS load based cells to name a few. Two inverters 603 and 604 connected back to back forms the memory element. This memory element is a latch providing complementary outputs S0 and S0′. The latch can be constructed as full CMOS, R-load, PMOS load or any other. Power and ground terminals for the inverters are not shown in FIG.-6A. Access NMOS transistors 601 and 602, and access wires GA, GB, BL and BS provide the means to configure the memory element. Applying zero and one on BL and BS respectively, and raising GA and GB high enables writing zero into device 601 and one into device 602. The output S0 delivers a logic one. Applying one and zero on BL and BS respectively, and raising GA and GB high enables writing one into device 601 and zero into device 602. The output S0 delivers a logic zero. The SRAM construction may allow applying only a zero signal at BL or BS to write data into the latch. The SRAM cell may have only one access transistor 601 or 602. The SRAM latch will hold the data state as long as power is on. When the power is turned off, the SRAM bit needs to be restored to its previous state from an outside permanent memory. In the literature for programmable logic, this second non-volatile memory is also called configuration memory. Upon power up, an external or an internal CPU loads the external configuration memory to internal configuration memory locations. All of FPGA functionality is controlled by the internal configuration memory. The SRAM configuration circuit in FIG.-6A controlling logic pass-gate is illustrated in FIG.-6B. Element 650 represents the configuration circuit. The S0 output directly driven by the memory element shown in FIG.-6A drives the pass-gate 610 gate electrode. In addition to S0 output and the memory cell, power, ground, data-in and write-enable signals in 650 constitutes the SRAM configuration circuit. Write enable circuitry includes GA, GB, BL, BS signals shown in FIG.-6A.
As discussed earlier, providing programmability is a very severe transistor and cost penalty compared to hard-wired Gate Array or ASIC implementation of identical logic. A significant factor in the penalty comes from the 6-transistors required for the configuration circuits. The natural conclusion is to minimize the number of configurable bits used in the programmable logic element. This mandates constructing a hard-wired larger 6LUT or a bigger LUT for next generation FPGAs. We have shown that Silicon utilization is severely impacted with this move towards larger LUT structures in logic elements. What is desirable is to have an economical and flexible LUT macro-cell, or a macro-LUT circuit. This LUT macro-cell should efficiently implement logic functions. Both large logic functions that port to one big LUT and small logic functions that port to multiple smaller LUTs should fit easily into a LUT macro-cell. Furthermore, LUT logic packing should maximize Silicon utilization to keep programmable logic cost reasonable with other hard-wired IC manufacturing choices. The user should be able to take a synthesized netlist from an ASIC flow, typically comprising smaller logic blocks, convert this netlist to fit in the FPGA granularity, place and route logic economically and efficiently. This would make use of existing third party ASIC tools at the front-end logic design and streamline tool flow for FPGA place & routing.
For an emulation device, the cost of programmability is not the primary concern if such a device provides a migration path to a lower cost. Today an FPGA migration to a Gate Array requires a new design to ensure timing closure. A desirable migration path is to keep the timing of the original FPGA design intact. That would avoid valuable re-engineering time, opportunity costs and time to solution (TTS). Such a conversion should occur in the same base die to avoid Silicon and system re-qualification costs and implementation delays. Such a conversion should also realize an end product that is competitive with an equivalent standard cell ASIC or a Gate Array product in cost and performance. Such an FPGA device will also target applications that are cost sensitive, have short life cycles and demand high volumes.