This invention relates to control of clock-to-output time in a specialized processing block, particularly in a programmable integrated circuit device such as, e.g., a programmable logic device (PLD).
As applications for which PLDs are used increase in complexity, it has become more common to design PLDs to include specialized processing blocks in addition to blocks of generic programmable logic resources. Such specialized processing blocks may include a concentration of circuitry on a PLD that has been partly or fully hardwired to perform one or more specific tasks, such as a logical or a mathematical operation. A specialized processing block may also contain one or more specialized structures, such as an array of configurable memory elements. Examples of structures that are commonly implemented in such specialized processing blocks include: multipliers, arithmetic logic units (ALUs), barrel-shifters, various memory elements (such as FIFO/LIFO/SIPO/RAM/ROM/CAM blocks and register files), AND/NAND/OR/NOR arrays, etc., or combinations thereof.
One particularly useful type of specialized processing block that has been provided on PLDs is a digital signal processing (DSP) block, which may be used to process, e.g., audio signals. Such blocks are frequently also referred to as multiply-accumulate (“MAC”) blocks, because they include structures to perform multiplication operations, and sums and/or accumulations of multiplication operations.
For example, PLDs sold by Altera Corporation, of San c) Jose, Calif., as part of the STRATIX® family, include DSP blocks, each of which may include four 18-by-18 multipliers. Each of those DSP blocks also may include adders and registers, as well as programmable connectors (e.g., multiplexers) that allow the various components to be configured in different ways.
In each such block, there are competing timing constraints. Normally, a user would like to be able to operate the block, along with the device of which it is a part, at the highest possible clock speed. However, the speed of each component within the block may limit the amount that can be accomplished within each clock cycle. For example, in the aforementioned example, if the multipliers are designed to operate at a certain speed, but the adders are slower, then any function that requires use of both components can operate at a maximum clock speed determined by the slower component.
One way to lessen the effects of the slower components is to use the registers in the block to pipeline the function to be performed. Thus, if two successive components cannot complete their operations on a particular datum within the same clock cycle, by the use of registers between the components, they can be divided between clock cycles, so that only one component need complete operation in any one clock cycle. This is of value in cases where the speed of the bottleneck component is fast compared to the maximum device clock but not fast enough to be combined with another component within a single clock cycle. If the bottleneck component is slow compared to the maximum system clock, then pipelining will not help because the device clock still will have to be slowed down to the speed of the bottleneck component.
In any event, pipelining has its own drawbacks even where it may be helpful. Pipelining introduces latency, in that the final result of the pipelined process is delayed by one clock cycle per added pipeline stage—the device can operate at the desired clock rate, but the number of clock cycles until the result is provided increases. Once the first result is provided, subsequent results will continue to flow, and for some applications, such as playback of recorded media, this may be sufficient. However, for other applications, such as real-time audio or video communications, latency may be unacceptable.
Alternatives, such as moving slower functions outside the specialized processing block where they are not subject to the timing constraints within the block, may increase the time-from-clock-to-output (Tco)—i.e., the time from the clock on which the block output is completed until the output reaches the first unregistered destination outside the block—because additional time is needed for the post-block processing, sacrificing Tco for clock speed, or “fmax”.