1. Field of the Invention
The invention relates to field programmable gate array circuits used in digital circuit design, and in particular, to an architectural design for their implementation on a molecular level.
2. Description of the Prior Art
This description outlines the prior practices in digital systems design and the construction of the gate array forms of integrated circuits.
A wide class of digital circuitry, referred to as clocked-mode (synchronous) digital systems, can be represented in a form shown in FIG. 1. Any clocked-mode system can be represented as a "block" of combinational logic and a linear array of registers. Combinational logic makes binary ("0" or "1") decisions based on Boolean functions of binary inputs and is the generalization of combinations of simple logic gates having one or two inputs (e.g., "AND", "OR", etc.) to a "block" of potentially many logic gates that inter-relate m inputs and n outputs. Even as a two-input "AND" gate relates a nearly trivial configuration of m=2 inputs and n=1 output, a generalized combinational logic block might have hundreds or more inputs and outputs.
The second part of a clocked-mode digital system, a linear array of registers, is equivalent in this representation to a set of data ("D") flip-flops with a single, common clock input. The data inputs to the array of registers come from each of the n outputs of the combinational logic block. The outputs of these registers are then either: (1) outputs (y(t)) of the digital circuitry block, or (2) inputs that are "fed" back (quantity of r signals) into the combinational logic block (h(t)). The registers synchronize the overall operation of a clocked-mode digital system by permitting a change in the outputs only in relation to a single clock. This feature of synchronization is important since it is minimizes perturbations, especially in the inputs that are "fed" back into the combinational logic circuitry (eliminating the so-called "race" conditions). This approach to digital design is popular, predictable (easily modeled), and well-suited to automatic design methods, because when exploited properly, it creates circuitry with completely deterministic behavior.
In other digital systems representations and design approaches (referred to as asynchronous digital systems), the need for such synchronization is relaxed, albeit with considerably more involved design and analysis. The class of asynchronous systems can be shown to be the most general digital representation, and clocked-mode representations are a subset of asynchronous systems. In asynchronous systems, the lack of synchronizing/latching structures like registers and flip-flops make designs much more complex, and sometimes the feedback paths that give rise to sequential behavior become sensitive to process and circuit layout particulars that may escape less careful analyses. Mismatches in timing of paths within a circuit can create the well known hazard and "race" conditions where undesired transitions occur based on skewed information delivery to decision points within the circuitry. Most of the present invention is based on a synchronous model, but templates will also be discussed that can support a rich variety of asynchronous interactions as well.
The clocked-mode of representation is the basis of most digital circuitry today, to include finite state machines, micro-sequencers, central processing units, and many custom designs. As most of the automated synthesis of contemporary digital systems designs is based on clocked-mode representations and since many field programmable gate arrays are intended for this mode of operation, this restriction does not significantly limit utility of the present invention. The key restriction to the application of clocked-mode circuitry is the existence of a single "synchronization domain", i.e., a single common clock controls the actions of the register array.
Very complex circuits contain multiple synchronization domains. It is not uncommon to divide very complex digital designs into synchronous and asynchronous sections, in which cases the synchronous content is usually dominant and lends itself to automatic design approaches. A full discussion of the ad hoc processes for multi-domain and asynchronous digital design are considerably involved and only have referential pertinence to the present invention. It is sufficient to indicate that the core concept of the present invention is based on approaches applicable to a single or a small number of synchronization domain(s) of a complex clocked-mode circuit. Asynchronous circuits can of course be more complex, since the synchronization domains may be ad hoc and in fact may be difficult to ascertain, which is among many of the reasons why asynchronous design is more complex and less represented in automated design approaches.
The role of storage and feedback in digital systems is necessary in order to implement stable and history-dependent behavior in a circuit. Combinational circuitry acts on the immediate values of input variables, which when changed or removed, can create a change in the output function. Changes in an output of a block of combinational logic are therefore subject to variations in the inputs. They are also subject to delays in the time responsiveness of the combinational circuitry itself, a real world effect which is largely due to the speed of signal propagation in circuit elements and the sluggishness of the circuitry to sudden changes (due to, for example, capacitive effects). Since digital systems need to rely on stable information, it is important that a decision based on the output of a combinational block be made after all delay effects have subsided. For this reason, the use of a register array is important, because it represents a snapshot in time of what should be the correct output of the combination circuitry. After the snapshot is taken, the inputs of the preceding combinational circuitry can change without affecting the registered output. Hence, "registered" can be thought of as "registration", in this case, registration to the edge of a pulsed clock signal. The highest speed at which a synchronous digital circuit can be operated is limited by the frequency of the clock pulses, and this frequency is limited by the longest combinational circuit path. Careful management of the delay effects and knowing when new inputs can be provided to the circuitry and when the clock can be "advanced" are the hallmarks of the present art of high-performance digital design. The registers in clocked-mode circuitry clearly facilitate the stability necessary for achieving this performance. Furthermore, when registers can provide feedback to the combinational network, they permit history-dependent behavior. The registers that are fed back into the combinational circuitry block can be said to encode state information.
Since combinational circuitry generates output(s) based on a Boolean function of one or more (or all) inputs, they can generate the value of, among other things, the value of the next state. Decoupled by the synchronization structure provided through the register array, this state is latched in by the clock to become the "new" (next) state. Finite state machine (FSM) behavior is strictly a manifestation of the existence of state information, hence the use of feedback is necessary for implementing complex digital systems. Generation of both outputs and states is accomplished through the combinational circuitry, and the snapshot of the current, correct outputs and states is accomplished through the register array.
It is important to observe the two extremes under which the combinational part of the synchronous digital system can be implemented. In the first extreme case, a combinational circuit can be represented as a very large look-up table (LUT). Since a combinational circuit with m inputs can be completely specified by truth table of 2.sup.m entries, it is simple conceptually to consider a circuit where all of these entries are contained in an electronic LUT, which is equivalent to a brute force electronic implementation of a truth table. A simple example based on a two-input AND gate is shown in FIG. 2. In FIG. 2(a), the symbol of the circuit is shown. In FIG. 2(b), the truth table which enumerates each of the 2.sup.2 =4 combinational possibilities is shown. In FIG. 2(c), a brute force matrix implements a look-up table (LUT) based on decoders and a matrix. The decoder is simply a circuit that has an output that is active ("high" or logical state "1") for only one of each possible combination of the inputs, which is identical to the number of truth table entries (2.sup.m). The matrix implements the truth table using a brute force approach. The column wire represents the output of the look up table. Here, the presence of a diode between a row and the single column is equivalent to having a logical "1" for the corresponding entry in the truth table. For the AND gate, of course, this condition only occurs when both inputs are high. The diodes are used instead of ordinary wires to prevent shorting through reverse paths, which becomes important for the case where more than one column exists.
It is a simple matter to extend the number of outputs by adding columns. For example, to implement the two function circuit in FIG. 3(a), which is an AND gate in parallel with an OR gate for the same inputs a and b, the circuit in FIG. 2(c), is expanded by adding another column. Of course, this in effect implements both truth tables of FIG. 3(b), resulting in the fmal circuit in FIG. 3(c).
Of course, these illustrative examples are not practical, primarily because the decoding circuitry itself is more complicated than the simplistic target example being shown. It is clear, however, as the approach is extended to more complex examples, that the decoding overhead becomes a less significant fraction of the circuit in question. The primary objective of this discussion is to provide a framework for discussing one extreme in implementing a combinational logic circuit with m inputs and n outputs. In summary, the look-up tables in this context are shown to consist of a decoder circuit for m inputs, and a m.times.n matrix with diodes representing entries where a logical "1" is present (for the truth table entry corresponding to that particular row and the output corresponding to that particular output) and without any diodes at all where a logic "0" is present.
It is important before examining the other extreme in combinational logic implementation to explore the matrix itself of the look-up table just described. Clearly, this look-up table represents information content, corresponding to a pattern of ones and zeros in number of truth tables (one for each output). In contemporary design, it is possible and normal to use a memory device to implement such information. Hence, a memory device can implement a look-up table. In FIG. 4, two examples of a memory are shown in look-up table applications. In FIG. 4(a), a very simple 16-bit memory (another illustrative but impractical device example) is shown, which has four inputs and a single output. For these examples, the control signals are omitted for clarity. The four input memory is identical to a four-input lookup table, which can notionally be any conceivable Boolean function of four inputs. In one example (FIG. 4(b)), a four-input OR gate is represented. A second example using the 16-bit memory, shown in (FIG. 4(c)), implements a more complex function, equivalent to several individual combinational logic gates. Considerably more complex fimctions are obviously possible, given the very dense memories available in contemporary integrated circuit design. For example, a semiconductor memory shown in FIG. 4(d), simple by modern standards, contains one megabit (2 20) of storage, which in this form can implement a combinational network of as many as 17 inputs and 8 outputs. This memory can (by definition) implement any eight independent truth tables or Boolean finctions (one for each of the eight outputs) of the same 17 input variables.
It is clear that every technology that has been used to implement a memory can be used to implement a look-up table. Common classes of such semiconductor memory include permanent, read only memory (ROM), programmable (usually fuse-link based) read-only memory (PROM), erasable (otherwise permanent) read-only memory (EPROM, UVPROM, EEPROM, "flash" ROM, etc.), and random-access memory (RAM). ROM and PROM memories and considered "one time programmable", useful for fixed implementations, but generally un-alterable. RAM-based approaches, on the other hand, can be altered at will (i.e., they are reconfigurable), provided that circuitry for re-configuration is built into the design. These two tenets--re-configurability and the means of re-configuration--are the cornerstones of field programmable gate arrays (FPGA), and these concepts are central certainly to the present invention as they are central to nearly every other existing FPGA.
Although a look-up table (LUT) implemented as memory is a clear and powerful technique for implementing a "block" of combinational logic, the more direct approach is to simply use individual logic gates, wired together as necessary, to form general combination logic networks. Hence, rather than implement the circuit in FIG. 4(c) as a 16-bit memory, the seven discrete logic elements (AND, OR, and inverter or "NOT" gates) would be used. Why use this method, given the conceptual simplicity of memory devices? Several reasons exist, but the simplest justification is derived from the example of a 100-input AND device. The device is conceptually simple, as it is merely a combinational logic circuit with 100 inputs and a single output. The device outputs a logical "1" only when all inputs are equal to logical "1"; otherwise, the output is logical "0". Implementing this device as a memory is intractable by present standards, as it would require a memory containing 2 100 bits of memory! As a direct implementation, however, it is simple to implement. Even built from elemental 2-input AND gates, this function would require only approximately (N/2)*log.sub.2 (N) (.about.50) gates to implement. It is clear that the number of logic gates required to implement a Boolean fuiction varies with its complexity of the function. The number of elemental (e.g. 2- or 3-input) gates needed to implement can be exponentially complex in the very worst case. However, in the vast majority of cases, logic functions for commonly used designs have much less complexity on average than the very worst case. Unfortunately, using a memory to implement logic always results in an exponential amount of circuitry, regardless of what finction is implemented, whether simple or complex.
A compromise between implementing a combinational logic block as an impossibly large but infinitely flexible memory device (serving as a massive LUT) and a large array of direct but inflexible logic devices is the fertile ground from which the field programmable gate array (FPGA) device field has been born. If this argument has a "punch line", it is that FPGA devices employ the use of many elemental LUTs in an interconnection matrix of wires that can be re-routed to some degree by software. The approach is summarized in FIG. 5. It seems that "total control" is possible, since both the behavior and connections are controllable. The behavior is controlled by establishing a desired pattern of ones and zeros in the various LUTs, and the connections are controlled by exploiting whatever reconfiguration potential exists in the routing manifold. LUT-based approaches bear attributes in common with both memory-based and gate-based implementation schemes. Since LUTs are in effect a memory, the offer the flexibility of memory. The difference is that in order to implement functions with large numbers of inputs, several LUTs are used instead of one massive memory. If, for example, a 12-input function could be implemented with four, 3-input LUTs, the total number of memory bits is 4*2 3=32 memory bits. On the other hand, with a memory-only approach, a similar implementation necessarily requires 2 12=4,096 memory bits (equivalent to 128, 3-input LUTs). The LUT-based approach achieves dramatic economy in storage requirements over brute-force memory approaches by limiting growth of bits in any single memory (LUT), and then relying on having the ability to apply many such LUTs to implement Boolean functions. In this respect, LUT-based approaches resemble approaches based on using elemental gates. The difference, of course, is that LUTs are capable of implementing any gate with the same number of inputs. There are, however, essentially two compromises in effect: (1) the fine mesh of LUTs can implement any function of only a small number of inputs and may not be able to implement all conceivable Boolean functions of larger numbers of inputs, and (2) the routing interconnection network must necessarily contain compromises that restrict some routing possibilities. The end product then is a sort of emulation of a direct approach in combinational logic using some granularization (no super-large LUTs) into a number of element LUTs with a manifold of interconnections. For RAM-based FPGAs, of course, both the contents of the LUTs and the switch patterns of the routing network are user re-configurable.
Contemporary field programmable gate arrays (FPGAs) are built as monolithic integrated circuits (ICs), usually involving silicon semiconductor technology. The technology of semiconductor fabrication involves a variety of processes that can be divided into high temperature (&gt;600 degrees Celsius) and low temperature (&lt;600 degrees Celsius). The high temperature processes include diffusions and oxidations, while low temperature processes include the metallization (wiring between transistors). Low temperature metallizations must be done after all high-temperature processing is completed. Performing an IC fabrication involves the serial processing of a group of wafers (lot) through many high and low temperature processes.
In the early days of digital ICs, all designs were done based on completely customized layouts of all IC features, including transistors and interconnects. It was later learned that the high temperature steps could be done generically for a large group of wafers, which could be stockpiled. When orders for specific IC devices were needed, these semi-fabricated wafers could be completed by finishing only the last few low-temperature steps, dramatically improved the pace at which customer orders for ICs could be filled. By changing only the last steps of nearly fabricated wafers, it was found that large classes of digital designs could be created by establishing a dense planar grid of transistor diffusions on the surface of a wafer and stockpiling them with undefmed metal layers. To form finalized integrated circuits, it was necessary to specify the metal interconnections between the pre-fabricated diffusions/transistors (through layout). Since the time-consuming and complex task of the transistor fabrication was already done, the simpler design and fabrication steps involving metal interconnections could be done in a very short time. Such devices, referred to as gate arrays, are used to implement complex digital integrated circuits quickly by personalizing the metal interconnections on silicon wafers that contain a large pre-fabricated array of transistors. "Personalization" is an act of design that allows the wafer to be mask-customized for specific functions through the process of integrated circuit layout, which involves a variety of patterning steps through which intentional designs are conveyed during fabrication.
Field programmable gate arrays (FPGAs) carry the analogy of speedy customization of partially fabricated gate arrays one step further. In particular, FPGAs defer the functional specification of its internal circuit configuration until after the chip is built through software. In this case, a designer personalizes (using software only) the configuration of IC "chips" that are completely pre-fabricated and sometimes in the user's own inventory, dramatically reducing the time to achieve a specialized IC as the delay of fabrication is completely eliminated. FPGAs rely on a large number of special pre-fabricated circuit structures that can be configured and connected under software control to form finctions that are in many cases equivalent to those that would otherwise be built with "semi"-prefabricated gate arrays or fully customized designs.
FPGAs can be viewed as devices that predominately contain large numbers of logic and routing "resources." They can in some sense be viewed as a pool of building blocks that can be configured and connected at will into more complex circuits. Physically, this is not done by adding material (e.g., wires) to the device but by setting and clearing bit patterns in what is referred to as a device configuration memory. The bits of the configuration memory have nothing to do with the device's actual operation, but rather correspond to the specification of a behavior pattern (for logic resources) or the bridging/separation of wiring paths between various points within the device. Configuration memory specifies the operation of the device, just as software specifies the operation of a computer. But whereas a computer is based on blocks of logic and wiring that are fixed, the FPGA creates in effect the appearance of a moldable block of logic and resources. The bridges in physical reality are always present or have the potential of being present nearly everywhere in the device. The design process then for FPGAs is reduced to specifying a particular subset of the potential connections and behaviors necessary to effect a desired deliberate overall circuit. This circuit for almost all intents and purposes performs in a manner indistinguishable from a circuit made in a more traditional way (gate array or fill custom).
The patterns impressed into FPGAs to form circuits can be reversible or permanent, depending on the underlying process used to fabricate the original device. The reversible FPGAs are said to be re-configurable, whereas the permanent FPGAs are said to be "one-time programmable." Since this invention is concerned only with reversible patterns, only those FPGAs will be discussed further here.
In reversible or re-configurable FPGAs, the configuration pattern that defmes device behavior is usually transmitted electrically into the device upon the initial application of power into the device. In RAM-based FPGAs, the configuration of the FPGA is persistent only for as long as power is applied to the device. Once power is interrupted, the pattern is lost and must be re-established. In practice, this need to re-fresh can be handled in several standard ways, and in fact is often desirable as a feature. Such RAM-based FPGAs can be updated even after a system containing them has been placed in service. In RAM-based FPGAs, logic and routing resources comprise the essential building blocks from which general-purpose digital systems can be made.
The most important concept for implementing logic resources in the FPGA is the look-up table (LUT). A LUT can be viewed as a Boolean function generator of m Boolean input variables. Since m inputs can be formed in 2.sup.m possible ways, it is relatively straightforward to form an m-input LUT with a 2.sup.m -bit memory that has a one-bit wide data path and m address bits. Two possible equivalent implementations of a 3-input look-up table (3-LUT) are shown in FIG. 6. In the figure, A is the symbol of 3-LUT; B is the K-map representation; C is the implementation using N-pass transistors; and D is the implementation using 2-input logic gates.
LUTs are capable of implementing all 2 (2.sup.m) possible functions of m variables, and in any sense, the m-LUTs (LUTs with m input variables) are completely capable of simulating/implementing any m-input function.
Since general digital systems must be capable of history-dependent behavior, it is important to implement memory in designs. For many LUT-based designs, a single memory bit is included at the output of each LUT. When implemented as shown in FIG. 7, a memory feature can be optionally incorporated in a LUT. A multiplexer (data selector) allows the use/bypass of a single memory bit (shown in the form of a "data" or "D" flip-flop) when the data selector is set to a "1" or "0" by the state of a single configuration memory bit. When selected, the output of the LUT is registered in synchronization with the clock signal. This establishes the basis for state machine behavior. In this case, the contents of memory comprise the state. When the data selector is bypassed, the output of the LUT is passed directly to the output.
FPGAs also contain routing resources, which are usually in the form of wires and transistor-based switches between those wires. A piece of the routing fabric that might be contained in a typical FPGA is shown in FIG. 8. In this example, 12 wires (labeled "a" through "l") and 20 switches (not labeled) are shown. The circle represents a switch as shown in FIG. 8(b). The switch is controlled by a single bit of configuration memory, which shorts together the wires when set to "1", and otherwise leaves the wires open-connected. Symbolically, an open circle represents an open switch and a filled circle a closed switch. For illustrative purposes, FIG. 8(c) depicts a battery at terminal "a" connected to a light bulb at terminal "f". As shown in FIG. 8(d), in order to turn on the light, at least two switches must be closed.
It is true today and it will always be likely that FPGAs by their very nature will inferior in performance to devices made using fixed wiring integrated circuits, such as standard cell gate arrays or full custom designs. Here, performance refers to speed and density. This is due to the fact that FPGAs implement much more routing circuitry than an equivalent gate array or full custom IC (which connect metal wires only between intended points on a specific design). Correspondingly, FPGAs are much more sluggish due to excess parasitic capacitance associated with the additional interconnect needed to guarantee the possibility of general connection to many points with possible designs. Another factor in the sluggishness of FPGAs is the extra amount of silicon required to permit reconfiguration. According to DeHon, an FPGA may take 100.times. silicon area to implement the equivalent function in a fixed-wire, fixed-logic approach (i.e., the standard cell gate array or full custom IC). [DeHon, Andre. "Reconfigurable Architectures for General-Purpose Computing". Massachusetts Institute of Technology, A.I. Technical Report No. 1586, October 1996]. As such, there is additional propagation delay due to time of flight over a longer distance. Since both the additional capacitance and time of flight are delay factors, the FPGAs are generally incapable of the maximum possible performance in any given silicon technology.
For reasons of performance, modem silicon FPGAs employ structures that at their essence are the same LUTs and routing resources previously described, but far more elaborate. In fact, 80-90% of the silicon area of typical FPGAs are dominated by interconnection-related resources. The embellishments attempt to enhance the performance of FPGAs and improve their generality in application the widest cross section of digital designs that are popular at the time of introduction. In the research of considerably advanced electronics, in which the critical device dimensions are on a nanometer scale, many problems fundamental to device engineering exists. At these dimensions, no effective lithographic techniques exist. Furthermore, interconnection supplies appear to be very constrained. While even at contemporary device scales (180-250 nm) these problems exist, it is still practical to assemble circuits where an individual node might have hundreds of connections. At the nanometer scale, particularly for molecular electronics, it appears possible to converge only a very limited number of electronic connections at a single physical location (e.g., 2-6). Finally, it is likely that many random defects will occur in electronics fabricated at a molecular scale. Only structures that appear to have high regularity are likely to have the ability to recover fimctionality in the presence of such defects. As such, many ordinary architectures designed in silicon, where defect densities are controllable, will be unsuitable for application at a nano-scale, due to interconnection demand.