1. Field of the Invention
The present invention relates to field programmable gate array (FPGA) architectures. More specifically, the invention relates to an area efficient interconnect scheme for a cluster based FPGA architecture that connects inter-cluster routing tracks to the inputs of look-up tables (or other logic cells) in the cluster.
2. The Prior Art
A cluster architecture is a type of FPGA architecture in which the basic repeating layout tile is a cluster. The cluster is an aggregation of logic blocks and routing multiplexers. Usually, a limited number of inputs are provided into the cluster in order to save area. A routing multiplexer is a basic FPGA routing element with multiple inputs and one output. It can be programmed to connect one of its inputs to the output. The number of inputs to the routing multiplexer is called the multiplexer size. A crossbar is equivalent to M multiplexers with each multiplexer selecting an output from a subset of N inputs. An N×M crossbar connects N different inputs to M outputs. If the N inputs are drawn as N horizontal wires, and M outputs vertical wires, there are N*M crosspoints, with each one representing a possible input-output connection. The number of connections (or switches) in a crossbar is the number of provided connections. A fully populated crossbar has N*M connections. A p % sparsely populated crossbar has (NM*p %) connections.
A cluster input interconnect scheme is an interconnect network that connects inter-cluster routing tracks to inputs of lookup tables (LUTs) (or other logic cells). It usually consists of multiplexers. Depending on the number of multiplexers that a routing track signal needs to pass through to reach LUT inputs, it could be classified as a one-level scheme or a two-level scheme. Depending on the number of unique signals that may be routed to the LUT inputs simultaneously, it could be classified as “having input bandwidth limitation” or “not having input bandwidth limitation.” Usually, one-level schemes do not have input bandwidth limitation, while two-level schemes exhibit input bandwidth limitation.
A one-level input interconnect scheme is a scheme that connects the routing tracks directly to the logic cells or LUT input multiplexers and usually has no bandwidth limitation. This scheme has been used, for example, in FPGAs available from Xilinx of San Jose, Calif. An illustrative example of such a scheme is shown in FIG. 1. This scheme takes signals from a plurality of T input tracks 10-1 through 10-T. A plurality of M input signals on lines 12-1 through 12-M are programmably connected to the inputs of multiplexers 14-1 through 14-P through an interconnect matrix 16 including programmable interconnect elements. There are numerous kinds of programmable interconnect elements as is known in the art.
The outputs of multiplexers 14-1 through 14-P each feed an input of one of N LUTs identified by reference numerals 18-1 through 18-N. Each of LUTS 18-1 through 18-N has multiple inputs. Let S be the number of inputs of the LUT, or LUT size (for example, S=4 for 4 input LUT). Therefore, the number of input multiplexers P=S*N (total number of LUT inputs for N LUTs). The number of input signals M<=P*MUX size, since each input signal is allowed to fan out to more than 1 input MUX. Finally, the number of routing tracks T>=M.
Architectures of the type shown in FIG. 1 are usually not bandwidth limited in that the total number of input signals that are provided is at least equal to or (more often) considerably larger than the total number of multiplexer inputs; i.e. M>=P.
A two-level input interconnect scheme is a scheme that connects the routing tracks first to inputs of first-level multiplexers. The outputs of the first-level multiplexers are connected to inputs of LUT input multiplexers (or second-level multiplexers). The two-level input interconnect scheme includes first and second stage crossbars.
An example of a two-level input interconnect scheme is shown in FIG. 2. As in the scheme shown in FIG. 1, the two-level interconnect scheme shown in FIG. 2 takes signals from a plurality of T input tracks 10-1 through 10-T. A plurality of M input signals are connected to the inputs of first-level multiplexers 14-1 through 14-10 using an interconnect matrix crossbar 16. Multiplexers 14-1 through 14-10 are shown each having sixteen inputs.
The outputs of the first-level multiplexers 14-1 through 14-10 are connected to the inputs of P (P=16) second-level multiplexers 18-1 through 18-16 using an interconnect matrix crossbar 20. The outputs of multiplexers 18-1 through 18-16 each feed an input of one of N LUTs (N=4) identified by reference numerals 24-1 through 24-4. Each of LUTs 24-1 through 24-4 has S inputs. As in FIG. 1, the number of second-level multiplexers P=S*N. The number of first-level multiplexers 14-1 through 14-10 is K=(S*(N+1)/2). Also, as in FIG. 1, the number of first multiplexer input signals M<=K*MUX size, since each input signal is allowed to fan out to multiple first-level MUXes. And the number of routing tracks T>=M.
Prior-art two-level schemes have bandwidth limitations. The bandwidth limitation comes from the fact that the number of first-level MUXes K (=(S*(N+1)/2)) is smaller than the number of LUT input MUXes P (=S*N), which means that N LUTs (i.e., S*N LUT inputs) have to share at most K unique input signals. The bandwidth limitation is necessary to make the scheme area efficient. There are many publications discussing how large the bandwidth limitation should be. For four-input LUTs, a type of logic block commonly used in FPGAs, the limitation on the number of unique signals going into a cluster simultaneously is generally accepted to be 4*(N+1)/2=2N+2, where N is the number of four-input LUTs in a cluster.
An input bandwidth limitation is the number of unique routing track signals that can be simultaneously routed to the LUT inputs through a cluster input interconnect. A cluster of N LUTs each having S inputs could need S*N unique signals in the worst case. If the number of unique input signals (out of M available to the cluster) that can be simultaneously routed to the LUT inputs is smaller than S*N, then it is said that the cluster (or the cluster input interconnect) has input bandwidth limitation. Otherwise, the cluster (or the cluster input interconnect) has no bandwidth limitation.
The bandwidth limit imposes a hard constraint in clustering, i.e., if the number of unique external signals required by the cells in the cluster exceeds the bandwidth limit, the cluster is not routable. Such a scheme has been used in academia (VPR-type architecture). A VPR-type architecture is an FPGA architecture popular in academia that is based on LUT clusters. The cluster input scheme in VPR-type architecture is a two-level scheme with bandwidth limitation S*(N+1)/2. The first interconnect crossbar is usually sparsely populated, and the second interconnect crossbar is assumed to be fully populated. A VPR-type architecture usually assumes full population in the second crossbar, which is very area expensive.
Such a scheme has also been used in FPGAs available from Altera Corp. of San Jose, Calif. Commercial products like the Stratix line of products from Altera use 50% connection population in the second crossbar.
Researchers have studied the depopulation of two-level interconnect schemes by looking into each stage separately. The research has concluded that having K>=S*N number of first-level MUXes in such a scheme (i.e., no bandwidth limitation, or allowing all LUT inputs to have unique input signal) is excessive and therefore a waste of resources. On the other hand, at least one article has indicated that an M=K*MUX size depopulation scheme provides poor routability (see Guy Lemieux and David Lewis. Design of Interconnection Networks for Programmable Logic. Kluwer Academic Publishers, 2004 (“Lemieux and Lewis”)).
In the prior art, the Monte Carlo method is used for measuring routability. This method picks a large number of random routing vectors, and measures the percentage of them that can be routed on a routing structure. The obtained percentage measures the routability of the routing structure, and can be used to guide iterative improvement of the connectivity in the routing structure. This method can only be used for a one-level crossbar.