Over time, processor speeds have increased faster than the rate at which information can enter and exit a chip. In many cases, it was found that increasing processor speed while ignoring the effects of input/output (I/O) produced little improvement—essentially, if information cannot get into or out of the chip at a fast enough rate, then increasing CPU speed diminishes in importance.
Data transfer to and from a chip can be improved by increasing the bit rate and/or the number of I/O pins. Since pins cannot be miniaturized to the same extent as transistors (pins must be physically strong enough to withstand contact), the rate at which the number of transistors on a chip has increased far outpaces the rate at which the number of pins on a chip has increased. For example, in Intel microprocessors, the number of transistors has increased by a factor of 20,000 in the last 30 years, whereas the number of pins in these chips increased merely by a factor of 30. Hence, the rate at which a chip can generate and process information is much larger than the available conduit to convey this information. The restriction imposed by the unavailability of a sufficient number of pins in a chip is called “pin limitation.”
An example of the magnitude of the problem is presented by reconfigurable architectures, in particular, integrated circuit chips such as Field Programmable Gate Arrays (FPGAs). An FPGA is an array of programmable logic elements, all of which must be configured to suit the application at hand. A typical FPGA structure consists of a two-dimensional array of configurable logic elements connected by a configurable interconnection network, such as shown in FIG. 1. FIG. 1 shows a networked structure, where the configurable logic blocks (CLBs) are the configurable functional elements, and the switches “S” are the configurable elements in the interconnection network. Each CLB in an FPGA is sometimes subdivided into smaller configurable logic elements. For example, the Xilinx Virtex-5 FPGA's CLBs each contain two elements known as slices. At the deepest level, the most basic functional element in an FPGA usually consists of some combination of one or more Look-Up Tables (LUTs), combinational logic gates, flip-flops, and other basic logic elements. In the Virtex-5 FPGA, each slice contains four 64×1 LUTs, four flip-flops, an arithmetic and carry chain, and several multiplexers used to combine the outputs of the LUTs. Often the CLBs in an FPGA are also interspersed with other functional units, such as small memory blocks, other adder chains, and multipliers. Thus, a CLB can contain many configurable switches. Notwithstanding variations in FPGA terminology, we will use the term “CLB” to denote the basic unit represented in FIG. 2.
The FPGA's interconnection network is typically a two-dimensional mesh of configurable switches. As in a CLB, each switch S represents a large bank of configurable elements. The state of all switches and elements within all CLBs is referred to as a “configuration” of the FPGA. Because there is a large number of configurable elements in an FPGA (LUTs, flip-flops, switches, etc.), a single configuration requires a large amount of information. For example, the Xilinx Virtex-5 FPGA with a 240×108 array of CLBs requires in the order of 79 million bits for a single full configuration. The FPGA's CLBs are fine-grained functional elements that are incapable of executing instructions or generating configuration bits internally. Thus, configuration information must come from outside the chip. A limited amount of configuration information can be stored in the chip as “contexts;” however, given the limited amount of memory available on an FPGA for such a purpose, an application may require more contexts than can be stored on the FPGA. Hence, in most cases, configuration information must still come from outside the chip, and the pin limited input can have severe consequences for the time needed for reconfiguration.
A number of applications benefit from a technique called dynamic reconfiguration, in which elements of the FPGA chip are reconfigured to alter their interconnections and functionality while the application is executing on the FPCA. Dynamic reconfiguration has two main benefits. First, a dynamically reconfigurable architecture can reconfigure between various stages of an application to use its resources efficiently at each stage. That is, it reuses hardware resources more efficiently across different parts of an algorithm. For example, an algorithm using two multipliers in Stage 1 and eight adders in Stage 2 can run on dynamically reconfigurable hardware that configures as two multipliers for Stage 1 and as eight adders for Stage 2. Consequently, this algorithm will run on hardware that has two multipliers or eight adders, as opposed to a non-configurable architecture that would need two multipliers and eight adders.
The second benefit of dynamic reconfiguration is a fine tuning of the architecture to exploit characteristics of a given instance of the problem. For example in matching a sequence to a given pattern, the internal “comparator” structure can be fine-tuned to the pattern. Further, this tuning to a problem instance can also produce faster solutions.
Dynamic reconfiguration requires a fast reconfiguration scheme. Because of this, partial reconfiguration is normally performed where only a portion of the FPGA is reconfigured. Partial configuration involves selecting the portion of the FPGA requiring reconfiguration (the addresses) and inputting the necessary configuration bits. Due to pin limitation, only a very coarse selection of addresses is available in a given time increment, resulting in a still substantially large number of FPGA elements being selected for reconfiguration. This implies that elements that do not need to be reconfigured must be “configured” anyway along with those that actually require reconfiguration.
In partial reconfiguration, the information entering the chip can be classified into two categories: (a) selection and (b) configuration. The selection information contains the addresses of the elements that require reconfiguration, while the configuration information contains the necessary bits to set the state of the targeted elements.
In order to facilitate partial reconfiguration, FPGAs are typically divided into sets of frames, where a frame is the smallest addressable unit for reconfiguration. In current FPGAs, a frame is typically one or more columns of CLBs. Currently, partial reconfiguration can only address and configure a single frame at a time, as a 1-hot decoder is usually employed. If we assume that each CLB receives the same number of configuration bits, say α, and the number of CLBs in each frame is the same, say C, then the number of configuration bits needed for each frame is Ca. If the number of bits needed for selecting a single frame is b, then the total number of bits B needed to reconfigure a frame is:B=b+C∞
Since the granularity of reconfiguration is at the frame level, every CLB in a frame would be reconfigured, regardless of whether or not the application required them to be reconfigured. This can result in a “poorly-focused” selection of elements for reconfiguration, as more elements than necessary are reconfigured in each iteration. This implies that a large number of bits and a large time overhead are spent on the reconfiguration of each individual frame. If the granularity of selection is made finer, i.e., if fewer CLBs are in each frame, then the number of selection bits needed to address the frames increases by a small amount while the number of configuration bits for each frame decreases. However since a 1-hot decoder can select only one frame per iteration, this also increases (on an average) the total number of iterations necessary to reconfigure the same amount of area in the FPGA. Pin limitation thus creates a severe restriction on the extent to which an FPGA can be dynamically reconfigured.
1.1 Notation
Before we proceed further, we introduce some notation.
In general, we use the term “word” to mean a set of bits. Different words may have different numbers of bits. We also use the terms “string” and “signal” synonymously with “word.”
The O(·) notation indicates an upper bound on the “order of” and is used to describe how the size of the input data affects resources (time, cost etc.) in an algorithm or hardware. Specifically, for two functions ƒ(n) and g(n) of a variable n, we say that ƒ(n)=O(g(n)) if and only if, there is positive constant c>0 and an integer constant n0, such that for all n≧n0, we have ƒ(n)≦cg(n). The relationship ƒ(n)=0(g(n)) signifies that the “order of” (or asymptotic complexity of) ƒ(n) is at most that of g(n) or that ƒ(n) increases at most as fast as g(n). If 0( . . . ) denotes a lower bound on the complexity, then Ω(·) and θ(·) indicates an upper bound on, and the exact complexity, respectively. Specifically, ƒ(n)=Ω(g(n)) if and only if g(n)=O(ƒ(n)). We say ƒ(n)=θ(g(n)) if and only if ƒ(n)=O(g(n)) and ƒ(n)=Ω(g(n)).
Parts of the invention will be described in terms of “ordered partitions.” A partition of set A is a division of the elements of the set into disjoint non-empty subsets (or blocks). A partition π with k blocks is called a k-partition. For example, a 3-partition of the set {8,7,6,5,4,3,2,1,0} is {{7,6,5,4},{3,2},{1,0}}. Partitions have no imposed order. An ordered k-partition is a k-partition {S0, S1, . . . , Sk−1) with an order (from 0 to k−1) imposed on the blocks. An ordered partition will be denoted ordered list of blocks. For instance, a 2-partition {S0,S1} may be ordered as S0,S1 or and S1,S0 and S0≠S0.
A useful operation on partitions is the product of two partitions. Let π1 and π2 be two (unordered) partitions (not necessarily of the same size). Let π1={S1, S0, . . . , Sk} and π2={P0, P1, . . . , Pl} then their product π1π2 is a partition {Q0, Q1, . . . , Qm} such that for any block Qhεπ1π2, elements a, bεQh if and only if there are blocks Si,επ1 and Pjεπ2, such that a,bεSi∩Pj. That is, two elements are in the same block of π1π2 if and only if they are in one block of π1 and in one block of π2. For instance, consider the partitions π1={{7,6,5,4}, {3,2},{1,0}} and π2={{7,6},{5,4,3,2},{1,0}}. Then π1π2={{7,6},{5,4},{3,2},{1,0}=π2π1 
For any digital circuit, including those considered in this invention, an n-bit output can be viewed as a subset of an n-element set. Let Zn={0, 1, . . . , n−1}. Consider an n-bit signal A=A (n−1)A(n−2) . . . A(0) (where A(i) is the ith bit of A; in general, we will consider bit 0 to be the least significant bit or the lsb). If A is an n-bit output signal (or word) of a digital circuit, then it can be viewed as the subset iεZn:A(i)=1} of Zn. The n-bit string A is called the characteristic string of the above subset. The set {iεZn:A(i)=1} is said to be characterized by A and is sometimes referred to as the characteristic set. For example if n=8, then output A=00001101 corresponds to the subset {0,2,3}. Outputs 00000000 and 11111111 correspond to the empty set, Ø and Zn, respectively. (It should be noted that the convention could be changed to exchange the meanings of 0's and 1's. That is, a 0 (resp., 1) in the characteristic string represents the inclusion (resp., exclusion) of an element of Zn in the set. All ideas presented in this document apply also to this “active-low” convention.) Throughout this document, we assume (unless mentioned otherwise) that the base of all logarithms is 2. Consequently, we will write log n to indicate log2 n. We will also use the notation loga n to denote (log n)a.
1.2 Prior Art
Prior art methods to address the pin limitation problem include: (1) multiplexing, (2) storing information within the design, and (3) decoding. Multiplexing refers to combining a large number of channels into a single channel. This can be accomplished in a variety of ways depending on the technology. Each method assumes the availability of a very high speed, high bandwidth channel on which the multiplexing is performed. For example, in the optical domain, wavelength division multiplexing allows multiple signals of different wavelengths to travel simultaneously in a single waveguide. Time division multiplexing requires the multiplexed signal to be much faster than the signals multiplexed. Used blindly, this is largely useless in the FPGA setting, as it amounts to setting an unreasonably high clocking rate for parts of the FPGA.
Storing information within the design attempts to alleviate the pin limitation problem by generating most information needed for execution of an application inside the chip itself (as opposed to importing it from outside the chip). This requires a more “intelligent” chip. In an FPGA setting it boils down to an array of coarse grained processing elements rather than simple functional blocks (CLBs). One example is the use of virtual wires in which each physical wire corresponding to an I/O pin is multiplexed among multiple logical wires. The logical wires are then pipelined at the maximum clocking frequency of the FPGA, in order to utilize the I/O pin as often as possible. Another example of such a solution is the Self-Reconfigurable Cate Array. This latter approach is a significant departure from current FPGA architectures. Yet another approach is to compress the configuration information, thereby reducing the number of bits sent into the chip.
Decoders are the third means used to address the pin limitation problem. A decoder is typically a combinational circuit that takes in as input a relatively small number of bits, say x bits, and outputs a larger number of bits, say n bits, according to some mapping; such a decoder is called an “x-to-n decoder.” If the x inputs are pins to the chip and the is n outputs are expanded within the chip, a decoder provides the means to deliver a large number of bits to the interior of the chip. An x-to-n decoder (that has x input bits) can clearly produce no more than 2x output sequences, and some prior knowledge must be incorporated in the decoder to produce a useful expansion to n output bits. Decoders have also been used before with FPGAs. Our invention when used in the context of FPGAs has more application in selecting parts of the chip in a more focused way than conventional decoders do. However in a broader context, the method we propose is a general decoder for any scheme employing fixed size code words, that decode into (larger) fixed size target words.
As we noted earlier, for any digital circuit, including a decoder, an n-bit output can be viewed as a subset of the n-element set Zn={0, 1, . . . , n−1). Thus, the set of outputs produced by an x-to-n decoder can be represented as a set of (at most 2x) subsets of Zn.
An illustration of 3-to-8 decoders (with 3 input bits and 8 output bits) is shown in Table 1.
Sets So, S1, S2 and S3 represent different decoders, each producing subsets of Zn. For instance, S0 corresponds to the set of subsets{{0},{1},{2}, . . . 7}}. This represents the 3-to-8 one-hot decoder.
Current decoders in FPGAs are fixed decoders, producing a fixed set of subsets (output bit combinations) over all possible inputs. The fixed decoder that is normally employed in most applications is the one-hot decoder that accepts a
TABLE 1Example of 3-to-8 DecodersDecoder InputsS0S1S2S300000000001010101011111111100001101001000000101010101000001111100100100100000010000110011000000111010001001100001000110011000000000100111101100000100000000111111110000010011101010010000011110000 11000000110100011100100000011111111100000001110000111110000000000000000011110001111110(log2 n)-bit input and generates a 1-element subset of Zn (see set So in Table 1). (In subsequent discussion all logarithms will be assumed to be to base 2, that is, log n=log2 n). In fact, the term “decoder” is usually taken to mean the one-hot decoder.
A one-hot decoder causes severe problems if, in an array of n elements, some arbitrary pattern of those elements is needed for reconfiguration. Here, selecting an appropriate subset can take up to θ(n) iterations. Notwithstanding this inflexibility, one-hot decoders are simple combinational circuits with a low O(n log n) gate cost (typically given as the number of gates) and a low O(log n) propagation delay. The one-hot decoder will usually take multiple cycles or iterations to set all desired elements to the desired configuration. Thus, reconfiguration is a time consuming task in current FPGAs and consequently, they fail to fully exploit the power of dynamic reconfiguration demonstrated on theoretical models.
Look-up tables (LUTs) can function as a “configurable decoder.” A 2x×n LUT is simply a (2x)-entry table, where each entry has n bits. It can produce 2x independently chosen n-bit patterns that can be selected by an x-bit address. LUTs are highly flexible as the n-bit patterns chosen for the LUT need no relationship to each other. Unfortunately, this “LUT decoder” is also costly; the gate cost of such a LUT is O(n2x). For a gate cost of 0(n log n), a LUT decoder can only produce 0(log n) subsets or mappings. To produce the same number of subsets as a one-hot decoder, the LUT decoder has 0(n2) gate cost. Clearly, this does not scale well.
What is needed is a configurable decoder that is an intermediary to the high flexibility, high cost LUT decoder and the low flexibility, low cost fixed decoder.