1. Field of the Invention
The invention pertains to the field of integrated circuits. More particularly, the invention pertains to routing architectures for use in programmable logic based integrated circuit devices.
2. The Prior Art
Programmable logic devices such as Field Programmable Gate Array (FPGA) integrated circuit devices are known in the art. An FPGA comprises any number of initially uncommitted logic modules arranged in an array along with an appropriate amount of initially uncommitted routing resources. Logic modules are circuits which can be configured to perform a variety of logic functions like, for example, AND-gates, OR-gates, NAND-gates, NOR-gates, XOR-gates, XNOR-gates, inverters, multiplexers, adders, latches, and flip/flops. Routing resources can include a mix of components such as wires, switches, multiplexers, and buffers. Logic modules, routing resources, and other features like, for example, I/O buffers and memory blocks, are the programmable elements of the FPGA.
The programmable elements have associated control elements (sometimes known as programming bits or configuration bits) which determine their functionality. The control elements may be thought of as binary bits having values such as on/off, conductive/non-conductive, true/false, or logic-1/logic-0 depending on the context. The control elements vary according to the technology employed and their mode of data storage may be either volatile or non-volatile. Volatile control elements, such as SRAM bits, lose their programming data when the PLD power supply is disconnected, disabled or turned off. Non-volatile control elements, such as antifuses and floating gate transistors, do not lose their programming data when the PLD power supply is removed. Some control elements, such as antifuses, can be programmed only one time and cannot be erased. Other control elements, such as SRAM bits and floating gate transistors, can have their programming data erased and may be reprogrammed many times. The detailed circuit implementation of the logic modules and routing resources can vary greatly and is appropriate for the type of control element used.
Typically a user creates a logic design inside manufacturer-supplied design software. The design software then takes the completed design and converts it into the appropriate mix of configured logic modules and other programmable elements, maps them into physical locations inside the FPGA, configures the interconnect to route the signals from one logic module to another, and generates the data structure necessary to assign values to the various control elements inside the FPGA.
Many FPGA architectures employing various different logic modules and interconnect arrangements are known in the art. Some architectures are flat while others are clustered. In a flat architecture, the logic modules may or may not be grouped together with other logic modules, but all of the logic modules have similar or nearly equivalent access to the larger routing architecture. In a clustered architecture, the logic modules are grouped together into clusters, meaning that all of the logic modules in the cluster have some degree of exclusive routing interrelationship between them relative to logic modules in other clusters.
FIG. 1 illustrates a block diagram of a prior art logic cluster 100 which illustrates some basic principles of a clustered architecture. Internal cluster routing network 101 routes signals input to the cluster 100 from External Horizontal & Vertical Routing (EHVR) 114 and from logic cell output feedback lines to the inputs of logic cells 102. The internal cluster routing network 101 includes Cluster Internal Routing (CIR) 110, Cluster Input Multiplexers, an exemplary one being labeled 112, and Logic Function Generator Input Multiplexers (an exemplary instance being labeled 108). The box designated Cluster Internal Routing (CIR) 110 contains the cluster internal interconnects and the box designated External Horizontal & Vertical Routing (EHVR) 114 contains the external interconnects and other routing resources of the larger FPGA (not shown).
The illustrative logic cluster of FIG. 1 contains four logic cells (functional blocks), an exemplary instance being labeled 102, though any number can be present. A logic cell 102 may comprise a logic function generator circuit (or LFG) 104 and an associated sequential element 106 (designated SE in the diagram), typically a flip/flop that can also be configured to be a latch. The ones in the diagram have four logic input interconnects since in practice a 4-input look-up table (LUT) is the most common function generator in this sort of architecture. The output interconnect of each LFG is coupled to the data input interconnect of the associated sequential element. The output of each logic function generator and each sequential element is coupled to a functional block output interconnect. The output interconnect coupled to the function generator is a combinational output interconnect while the output interconnect coupled to the sequential element is a sequential output interconnect.
Typically there are other features present in functional block 104 that are not shown in FIG. 1. For example, there may be a way to bypass the LFG and connect from the routing resources to the data input interconnect of the sequential element, or specialized circuitry to make implementing arithmetic functions more efficient. Typically the sequential element is more complicated than the simple D-flip/flop shown in FIG. 1. Often there will be, for example, a set signal, a reset signal, an enable signal, a load signal, and a clock signal (shown but without its source) present. Collectively these signals are called sequential control input interconnects and they typically have their own associated routing resources that may or may not be coupled to the cluster routing resources shown in FIG. 1.
In FIG. 1, CIR 110 contains only routing wires, while EHVR 114 contains a variety of different elements like switches, multiplexers, and buffers in addition to routing wires. Functional block output interconnects are cluster output interconnects if they connect to the EHVR 114. If a functional block output interconnect connects to the CIR 110, it is called a feedback interconnect since it allows cluster outputs to feed back to inputs of LFGs in the same cluster without leaving and reentering the cluster by means of cluster outputs and external interconnect and routing resources. A functional block output interconnect can be both a cluster output interconnect and a feedback interconnect if it connects to both the CIR and EHVR.
In FIG. 1, the Logic Function Generator Input Multiplexers (an exemplary instance being labeled 108) are coupled between the Cluster Internal Routing block and the various logic input interconnects on the functional blocks 102. Since there are four functional blocks each with four input interconnects, there are a total of sixteen Logic Function Generator Input Multiplexers in the exemplary cluster 100. Typically, the number of input interconnects on each Logic Function Generator Input Multiplexer is less than the total number of lines in the Cluster Internal Routing Lines block, so each Logic Function Generator Input Multiplexer can only transmit a subset of the signals inside CIR 110 to its associated LFG input.
In the architecture 100 of FIG. 1, signals are transmitted from the EHVR to the CIR by ten Cluster Input Multiplexers, an exemplary one being labeled 112. Various interconnects and resources from other parts of the FPGA are connected to the inputs of the Cluster Input Multiplexers by means of the External Horizontal & Vertical Routing 114. The lines internal to the Cluster Internal Routing box 110 come from a variety of sources: the outputs of the Cluster Input Multiplexers, the outputs of the cluster's LFGs and SEs, and possibly other sources such as clock networks and other special functions not shown in FIG. 1 to avoid overcomplicating the diagram.
As FPGAs get larger, clustered architectures are favored over completely flat ones, based on the ease of place and route, and how fast this task can be accomplished by the design software. There are many examples of clustered architectures in both the academic literature as well as in commercial products.
FIG. 2 shows an exemplary cluster 200 of a type known in the art employing busses in a portion of the internal cluster routing. Present in FIG. 2 are functional blocks 202, Level 1 multiplexers 204a through 204j, EHVR 206, cluster input interconnect busses 208a through 208j, functional block output bus 210, and feedback interconnects 212. This is an abstraction intended to focus attention on the relationships between classes of interconnects inside cluster 200 rather than on the detailed connections of a specific circuit topology.
The external horizontal and vertical routing (EHVR) 206 contains routing interconnects and other routing resources such as, for example, multiplexers, buffers, and control elements for programming and enabling them. Placing the balance of the FPGA routing in box 206 is a deliberate abstraction to allow focusing on the relationships of classes of interconnects inside cluster 200.
The level 1 multiplexers 204a through 204j are coupled to EHVR 206 by cluster input interconnect busses 208a through 208j. While interconnect busses 208a through 208j couple EHVR 206 to the level 1 multiplexers 204a through 204j, they do not connect to the feedback interconnects 212. In FIG. 2, they can be thought of as “passing under” them instead. This convention will be used with respect to various interconnect representations throughout this application, since drawing such busses in the more conventional “passing over” style makes the drawing figures harder to read and obscures the concepts being illustrated.
Examples of clusters such as shown in FIG. 2 are found in a number of different commercial FPGA families offered by Xilinx, Inc., of San Jose, Calif. Another cluster 220 of the prior art is shown in FIG. 3. Present in the drawing figure are functional blocks 222, level 1 multiplexers 224a through 224j, local interconnects 226, level 2 multiplexers 228a through 228j, EHVR 230, interconnect busses 232a through 232j, interconnect busses 234a through 234j (though only 234j is labeled in the figure), functional block output bus 236, and feedback interconnects 236.
FIG. 4 shows another bus-based FPGA cluster 250 of the prior art. Present in the drawing figure are functional blocks 252, level 1 multiplexers 254a through 254j, level 1 interconnects 256, level 2 multiplexers 258a through 258j, level 2 interconnects 260, level 3 multiplexers 262a through 262j, input busses 264a through 264j, EHVR 266, functional block output bus 268, and feedback interconnects 270. Similar to the clusters shown in FIG. 2 and FIG. 3, the numbers of functional blocks, the numbers of first, second and third level multiplexers, the numbers of interconnects in the various interconnect busses, and the number of input channels on the various multiplexers are all a matter of design choice.
The data flow for external signals is through interconnects originating in EHVR 266 that are coupled to some of the inputs of the third level multiplexers 262a through 262j. The outputs of the level 3 multiplexers are coupled to the level 2 interconnections 260 which in turn are coupled to the inputs on the level 2 multiplexers 258a through 258j. The outputs of the level 2 multiplexers 258a through 258j are coupled to the level 1 interconnects 256 which are coupled to the inputs of the level 1 multiplexers 254a through 254j, which in turn have their outputs coupled to the inputs of the functional blocks 252. Thus the cluster inputs enter the internal cluster routing resources at the level 3 multiplexers.
Another prior art cluster architecture is described in the textbook Guy Lemieux and David Lewis, Design of Interconnection Networks for Programmable Logic, Kluwer Academic Publishers, 2004 (henceforth “Lemieux”), page 28, FIG. 3.4. Commercial products using similar architectures can be found in a number of FPGA families offered by Altera Corporation of San Jose, Calif.
In Lemieux, Chapter 2, Section 2.1, pages 9-17, highly routable switching networks are discussed in general, including references to a number of well known switching networks such as Clos networks and Benes networks. These networks can be used in anything from telecommunications systems to integrated circuits. Routing architectures using these types of network structures may be used in programmable logic devices as an internal cluster routing network. These networks typically have at least three stages of switches and can often be optimized for decreased numbers of switches and improved routablility by increasing the number of levels of switches that signals must pass through. Unfortunately, when such an approach is used in an FPGA cluster, the resulting performance degradation is undesirable.
The multi-stage switching network structure referred to as the Clos network was first proposed by Charles Clos in 1953. Clos networks are based on a grouping of interconnected crossbar switches. A crossbar switch is a device that is capable of channeling data from any of its inputs to any of its outputs, up to its maximum number of ports. In the case of a multiplexer-based crossbar switch, the number of inputs to the switch (“x”) is the same as the number of inputs to each multiplexer. The number of outputs of the switch (“y”) is equal to the number of multiplexers. An example of a multiplexer-based crossbar switch 400 is shown in FIG. 5. Each output (y) can be coupled to any of the x inputs independently. The crossbar switch has x*y crosspoints. As shown in FIG. 5, such a full crossbar can be implemented using y number of x-inputs-to-one-output (“x-to-1” or “x-1”) multiplexers (“MUXes”) 408. In the example shown in FIG. 5, x=4 and y=3.
An example of a general 5-parameter asymmetrical 3-stage Clos network is shown in FIG. 6. This network is defined by five parameters: m, the number of outputs from each crossbar in stage one (equals the number of crossbars in stage two and the number of inputs to each crossbar in stage three); n1, the number of inputs to each crossbar in stage one; r1, the number of crossbars in stage one (equals the number of inputs to each crossbar in stage two); n2, the number of outputs from each crossbar in stage three (equals the number of inputs to each logic cell); and r2, the number of crossbars in stage three (equals the number of outputs from each crossbar in stage two). These parameters constitute a 5-tuple (m, n1, r1, n2, r2), for the three levels of crossbars. The first level of crossbars consists of r1 (n1-to-m) full crossbars, the second level consists of m (r1-to-r2) full crossbars, and the third level consists of r2 (m-to-n2) full crossbars. The number of inputs to the Clos network shown in FIG. 6 is (r1*n1), while the number of outputs is (r2*n2). This network can be used to connect (r1*n1) inputs to r2 logic cells, each having n2 inputs.
FIG. 7 shows a symmetrical Clos network. A symmetrical Clos network is one in which n1=n2=n, and r1=r2=r. A symmetrical Clos network such as shown in FIG. 7 has only three parameters (m, n, and r).
The “cost” of a Clos network (i.e., the amount of area taken up by the network, as well as the number of switches, together with the delay caused in the network by this number of switches) is typically measured by the number of crosspoints used in the network. For the asymmetrical case, the cost is r1*n1*m+m*r1*r2+r2*m*n2=m* (r1*n1+r1*r2+r2*n2). For the symmetrical case, the cost is m*(r^2+2*r*n). The cost is proportional to m, the number of middle level crossbars. Hence, the bigger m is, the higher the cost.
The routability of a Clos network (i.e., the ability to route signals from the inputs to the outputs) also depends on m, the number of middle stage crossbars. The higher m is, the better the routability. Non-blocking networks are highly routable. There are three types of non-blocking Clos networks. The first is strictly non-blocking, in which for any connection request from an input to an output or a set of outputs, it is always possible to provide a connection path through the network without disturbing other existing connections. If more than one such path is available, any path can be selected without being concerned about realization of future potential connection requests. The second type is wide-sense non-blocking. In this type, for any connection request from an input to an output or a set of outputs, it is always possible to provide a connection path through the network without disturbing other existing connections. If more than one such path is available, the path must be selected carefully (according to some selection algorithm) to maintain the non-blocking behavior for future potential connection requests. The third type is rearrangeably non-blocking. In this type, for any connection request from an input to an output or a set of outputs, it is always possible to provide a connection path through the network by rearranging other existing connections, if necessary.
In communication networks, typically, the cost of a strictly non-blocking network architecture is too high to make implementation practical. Wide sense non-blocking is more practical and can be built more efficiently, and is therefore a more common implementation of Clos networks in the communications context.
There are two types of routing requests that may be made to route a signal in a Clos network. The first type is unicast, in which each input can be connected to at most one output in a one-to-one fashion. The second type is multicast, in which each input can be connected to multiple outputs. A network that is non-blocking for multicast routing requires a bigger m than a unicast non-blocking network, and hence has a higher cost.
Known bounds on m with respect to the routability include the following. For wide sense multicast non-blocking there are two cases: a symmetrical network and an asymmetrical network. For the symmetrical case, m>min((n−1)*(x+r^1/x) where 1<=x<=min(n−1,r); optimizing x results in m>2(n−1)*(logr/loglogr)+(n−1)*(logr)^½. For the asymmetrical case, m>(n1−1)x+(n2−1)*r2^1/x, where 1<=x<=min(n2−1, r2); optimizing x results in m>2(n1−1)*(logr2/loglogr2)+(n2−1)*(logr2)^½. There is no known bound for rearrangeably multicast non-blocking. For strictly unicast non-blocking, m>=n1+n2−1 for the asymmetrical case, and m>=2*n−1 for the symmetrical case. For rearrangeably unicast non-blocking, m>=max(n1, n2) for the asymmetrical case, and m>=n for the symmetrical case.
For unicast non-blocking networks, it has been shown that in many cases the network will function as mostly non-blocking for multicast as well (i.e., the probability that a multicast routing request will be blocked is fairly low). See Yang and Wang, On Blocking Probability of Multicast Networks, IEEE Transactions on Communications, Vol. 46, No. 7, July 1998.
Most multistage network research has focused on the creation of non-blocking networks. From the perspective of programmable logic devices such as FPGAs, the routing problem is rearrangeably multicast. It is multicast because it is common for the output of a logic cell to go to multiple locations. Also, it is rearrangeable because only the final routing solution needs to satisfy the routing requirements, while the intermediate steps are irrelevant because when routing one connection, it is acceptable to rearrange (rip up and reroute) existing connections until an acceptable solution is determined. This is performed by sophisticated routing software typically provided by FPGA vendors to end users.
However, using a rearrangeable multicast non-blocking network to implement an FPGA interconnect is impractical due to its high cost (even though the bound is unknown, it will be at least as large as the bound for unicast non-blocking, as described above). It has more flexibility than actually needed in a real-world FPGA interconnect. It also fails to exploit locality to save area, which is a characteristic FPGA designs exhibit.
U.S. Pat. No. 6,294,928 to Lytle et al. (“Lytle”) discloses a Clos network-based programmable logic architecture employing crossbar switches. U.S. Pat. No. 6,975,139 to Pani, et al., (“Pani”) also discloses a Clos network-based FPGA routing architecture. Pani discloses an organization of two columns of crossbars followed by a column of LUTs.
Most applications of a Clos network require that m≧n1, which makes the network non-blocking for any set of unicast connections, though not necessarily for multicast connections. An example of a unicast non-blocking network having an m≧n1 constraint is shown in FIG. 7. Prior art Clos network-based programmable logic architectures such as Lytle and Pani all include the constraint that m≧n1. This restriction makes sense in the context of hierarchal architectures such as Pani. In hierarchical routing, a smaller number of signals from higher-level routing is expanded to drive a larger number of lower-level signals. The network disclosed in Pani is described as non-blocking and therefore the m≧n1 is a critical feature of this network.