Multiplier circuits are common in many types of systems, such as DSP (digital systems processing) systems. Therefore, several different types of multiplier circuits have been devised. One such type is the array multiplier circuit, in which a matrix of partial products is derived in parallel, and then a 2-dimensional array of full adders is used to sum the rows of partial products. The matrix of partial products is naturally trapezoidal in shape. However, the trapezoid can be skewed into a rectangle with the sum or carry bits being propagated diagonally. The rectangular array multiplier is regular in structure, and each cell in the rectangle is coupled only to the neighboring cells. Therefore, this architecture is suitable for implementation in an integrated circuit (IC).
FIG. 1 illustrates a well-known array multiplier circuit. The illustrated array multiplier circuit includes an N×N (N by N) array of cells (101, 102, 103, 104) including full adders plus adjacent half adders and AND gates, with a ripple carry adder (112, 113) added at the top of the array to provide the upper N bits of the final sum. In the circuit of FIG. 1, the two N-bit inputs to the multiplier circuit are X[N−1:0] and Y[N−1:0], and the 2N-bit product output of the multiplier circuit is P[2N−1:0]. Each &/FA sub-circuit 102 (see FIG. 2) includes a full adder and a logical AND gate coupled to one of the full adder inputs. The &/FA cell 102 provides the partial product bit SOUT and the carry out signal COUT from the carry input CIN, the two bit inputs YIN and ZIN, and the partial product input bit SIN. Each &/HA sub-circuit 103 (see FIG. 3) includes a half adder and a logical AND gate coupled to one of the half adder inputs. Each &/HA cell 103 provides the partial product bit SOUT and the carry out signal COUT from the carry input CIN, the two bit inputs YIN and ZIN, and the partial product input bit SIN. Each AND sub-circuit 104 (see FIG. 4) includes a logical AND gate driven by the corresponding YIN and ZIN inputs and providing the AND output signal ANDOUT. The N×N array provides the lower N bits of the product P[N−1:0].
The ripple carry adder at the top of the array includes full adder sub-circuits (RCFA 112) and a half adder sub-circuit (RCHA 113), with the ripple carry chain going from right to left as shown in FIG. 1. The ripple carry adder performs the final summation of the partial products and provides the upper N bits of the product P[2N−1:P[N]).
Thus, a standard array multiplier circuit can have a rectangular aspect well suited for implementation in an integrated circuit. However, a typical multiplier circuit includes several types of cells and thus is not completely regular in design.
Other multiplier architectures in common use utilize “Wallace trees”. These architectures use carry propagate adders instead of the long carry chains required by an array multiplexer. For sufficiently large values of N, these architectures have improved multiplier performance compared to the structure of FIG. 1, but at the price of having a much less regular structure. Thus, multipliers utilizing Wallace trees and similar methods may be less suited for implementation in array-type integrated circuits, e.g., in many programmable integrated circuits.
Programmable integrated circuits (ICs) are a well-known type of arrayed IC that can be programmed to perform specified logic functions. An exemplary type of programmable IC, the field programmable gate array (FPGA), typically includes an array of programmable tiles. These programmable tiles can include, for example, input/output blocks (IOBs), configurable logic blocks (CLBs), dedicated random access memory blocks (BRAM), multipliers, digital signal processing blocks (DSPs), processors, clock managers, delay lock loops (DLLs), and so forth.
Each programmable tile typically includes both programmable interconnect and programmable logic. The programmable interconnect typically includes a large number of interconnect lines of varying lengths interconnected by programmable interconnect points (PIPs). The programmable logic implements the logic of a user design using programmable elements that can include, for example, function generators, registers, arithmetic logic, and so forth.
The programmable interconnect and programmable logic are typically programmed by loading a stream of configuration data into internal configuration memory cells that define how the programmable elements are configured. The configuration data can be read from memory (e.g., from an external PROM) or written into the FPGA by an external device. The collective states of the individual memory cells then determine the function of the FPGA.
Another type of programmable IC is the Complex Programmable Logic Device, or CPLD. A CPLD includes two or more “function blocks” connected together and to input/output (I/O) resources by an interconnect switch matrix. Each function block of the CPLD includes a two-level AND/OR structure similar to those used in Programmable Logic Arrays (PLAs) and Programmable Array Logic (PAL) devices. In CPLDs, configuration data is typically stored on-chip in non-volatile memory. In some CPLDs, configuration data is stored on-chip in non-volatile memory, then downloaded to volatile memory as part of an initial configuration (programming) sequence.
For all of these programmable ICs, the functionality of the device is controlled by data bits provided to the device for that purpose. The data bits can be stored in volatile memory (e.g., static memory cells, as in FPGAs and some CPLDs), in non-volatile memory (e.g., FLASH memory, as in some CPLDs), or in any other type of memory cell.
Other programmable ICs are programmed by applying a processing layer, such as a metal layer, that programmably interconnects the various elements on the device. These ICs are known as mask programmable ICs. Programmable ICs can also be implemented in other ways, e.g., using fuse or antifuse technology. The terms “programmable integrated circuit” and “programmable IC” include but are not limited to these exemplary devices, as well as encompassing devices that are only partially programmable. For example, one type of programmable IC includes a combination of hard-coded transistor logic and a programmable switch fabric that programmably interconnects the hard-coded transistor logic.
Traditionally, programmable ICs include one or more extensive dedicated clock networks, as well as clock management blocks that provide clock signals for distribution to all portions of the IC via the dedicated clock networks. These clock management blocks can be quite complicated, encompassing, for example, digital locked loops (DLLs), phase locked loops (PLLs), and so forth. For example, the Virtex®-4 series of FPGAs from Xilinx, Inc. includes up to 20 clock management blocks, each providing individual clock deskewing, frequency synthesis, phase shifting, and/or dynamic reconfiguration for a portion of the IC. Thus, a significant amount of design and testing time is required to provide these features in the device, and their use also requires time and effort on the part of the system designer. Additionally, because a global clock signal may be needed at virtually any position in a programmable IC, a global clock network is very extensive and consumes large amounts of power when in use.
A large IC design typically has a large number of timing requirements. For example, a clock signal must reach the destination within a certain window within which the data being provided to the destination is valid. Meeting these timing requirements for every logic block in a large IC can present a significant challenge, particularly when complicated by issues such as multiple clock domains, skew, jitter, and process, voltage, and temperature variability. Thus, the well-known timing requirements known as the “setup time” for data (the amount of time by which the data signal must precede the active edge of the clock signal at the input terminals of the logic block) and the “hold time” for the data (the amount of time the data signal must remain at the data input terminal after the arrival of the active edge of the clock signal) are vital to the success of a clocked design, and must be met for every clocked element, or the logic cannot be expected to operate properly.
Therefore, it is clear that the design of reliable clock networks for a large programmable IC with multiple clock domains may consume a large amount of engineering resources and may adversely impact the design cycle of the programmable IC.
Programmable ICs are typically designed to be useful in a large variety of customer applications. Therefore, they tend to include a large number of substantially similar logic blocks that are designed with flexibility in mind. To improve the efficiency of certain target applications, including compute-intensive applications such as digital signal processing (DSP), specialized blocks may be included as well as the array(s) of highly flexible logic blocks. However, to achieve the optimum mix of flexibility and efficiency, it may be desirable to provide a programmable IC in which the logic blocks are optimized, in themselves, for compute-intensive applications.