1. Technical Field
This application relates to electronic systems and, more particularly, to clock signal distribution networks within digital electronic systems, and especially to clock distribution within integrated circuit (IC) chips that contain many processing units.
2. Description of the Related Art
For large, expensive computer systems, their economics dictates that they be kept busy all the time. Performance was traditionally measured as computations per second. For small, inexpensive computers, continuous high speed operation is not required, and is even a hindrance for battery operated devices. Increasingly computer and digital signal processor (DSP) performance is measured in computations per second per watt or computations per joule of energy used.
While there are entertainment applications that require high performance operation for hours at a time, most uses of small computers require bursts of high performance for less than a minute. In fact there are many time intervals when a small embedded computer or digital signal processor (DSP) may operate just fine at reduced speeds. Since the circuit technologies for microcomputers consume electrical power in proportion to compute speed; opportunities to run at reduced speed are opportunities to reduce power consumption and conserve battery charge. The opportunities may be greatest for personal electronic devices (PEDs), where human interests and attention place highly variable demands on the micro-computers and DSPs embedded therein.
Single Processor Systems
In a computer with only one processing unit, the processor can adjust its own speed by writing to special circuits that generate the system clock signal. This may be used to match the system clock frequency to the average workload. However reduced system clock frequency (or rate) also slows the resident kernel of the operating system software and its response time. Depending on implementation, users may notice pauses when the machine needs to up-shift to a faster clock rate for more computation per second type of performance.
Single-processor computers and their control software, often also have user adjustable time-outs; and the more power-down modes in the hardware, the more finely the system can adapt its power use to actual demand for computation. For example, a processor may switch to a reduced speed and reduced supply voltage state after an initial timeout, into a clock-stopped state after a longer timeout; and into a low voltage sleep state after a yet longer timeout. These low voltage states maintain data in volatile memory, which is advantageous to quick re-activation. If a processor's power is completely cut off the data in its volatile memory is lost; and upon re-activation of the processor, data will have to be reloaded from non-volatile memory.
Multi-Processor Systems
Large multiprocessor systems have pioneered many techniques to improve computations per second but have been less aggressive with power management. With the advent of PEDs using in-expensive IC chips containing multiple processing units, the demand for energy efficiency has increased a great deal.
Advantages of multiprocessing include much higher computational throughput for algorithms converted for parallel execution, and increased reliability and security due to separation of processes onto different processors and memories. In a multiprocessor system it is much less likely that a supervisory process executing on its own processor will be delayed by an application process executing on other processors.
Within applications, some processors may be slowed and others accelerated depending on external events. For example, the performance of a video processor for display of video data many depend on type of data and user activity. (In this example a video processor may be a single unit specialized for video, or it may be a group of processing elements programmed to processes video in a parallel way). If a user is editing video there may be frequent pauses in the display of motion. While paused, the video processor may be lowered to idle speed, ready to respond but dissipating less power than full speed. Meanwhile the user interface may be handled by a different processor optimized to for user interaction.
Another way to conserve power in a multi-processor system is to arrange for multiple processors to run on a variety of clock frequencies—fast clocks for critical paths in a computation and slower clocks for other parts. Since the opportunities to save power are highly dependent on application software, the clock distribution hardware should be configurable, preferably configurable rapidly from application software.
Multi-Processor Arrays
Increasingly, digital electronic systems, such as computers and digital signal processors (DSP), utilize one or more multi-processor arrays (MPAs). An MPA may be loosely defined as a plurality of processing elements (PEs), supporting memory (SM), and a high bandwidth interconnection network (IN). As used herein, the term “processing element” refers to a processor or CPU (central processing unit), microprocessor, or a processor core. The word “array” in MPA is used in its broadest sense to mean a plurality of computational units (each of which may contain processing and/or memory resources) interconnected by a network with connections available in one, two, three, or more dimensions, including circular dimensions (loops or rings). Note that a higher dimensioned MPA can be mapped onto fabrication media with fewer dimensions, provided that the media supports the increased wiring density. For example, an MPA with the shape of a four dimensional (4D) hypercube can be mapped onto a 3D stack of silicon integrated circuit (IC) chips, or onto a single 2D chip, or even a 1D line of computational units. Also low dimensional MPAs can be mapped to higher dimensional media. For example, a 1D line of computation units can be laid out in a serpentine shape onto the 2D plane of an IC chip, or coiled into a 3D stack of chips. An MPA may contain multiple types of computational units and interspersed arrangements of processors and memory. Also included in the broad sense of an MPA is a hierarchy or nested arrangement of MPAs, especially an MPA composed of interconnected IC chips where the IC chips contain one or more MPAs which may also have deeper hierarchal structure.
There may be one or more interconnection networks (INs) in an MPA or between MPAs of differing type. The purpose of interconnection networks in MPAs is to move data, instructions, status, configuration, or control information between and among PE, SM, and I/O. The primary interconnection network (PIN) is designed for high bandwidth data movement, with good but not extremely low latency (the time delay for the delivery of data between source and destination). The data moved by the PIN may encapsulate other types of information provided there is hardware or software at the data destination that is able to translate the data to the other types of information. An MPA may have other, secondary INs; these may exhibit lower or higher latency but generally will have much lower bandwidth.
An IN is composed of links and nodes. A link is typically composed of a set of parallel “wires” implemented as electrically conductive paths (tracks or traces) on a circuit board or an IC. A node contains ports for coupling to the links, which contain the transmitter and receiver circuits to send and receive signals on the links. A node may have other ports for communications with PE or SM. A node has a Router which contains data paths and switches for connecting ports to each other, plus a router control mechanism for selectively connecting ports according to one or more protocols.
To achieve high bandwidth each link of the PIN may include many parallel wires. If the distance between nodes is small, links are short and standard CMOS binary signaling scheme may be used; which is that a steady signal voltage near the high side of the power supply is a signal state (H) that represents a logical 1 and a steady signal voltage near the low (or ground) side of the power supply is the other binary state (L) and represents a logical 0. In this signaling scheme one wire encodes one bit of information. If the length of a link is long, such as between IC chips or between circuit boards, then different signaling schemes may be better suited to maintain high speed and reject noise.
The parallel wires in a link may carry data or clock signals. The purpose of a clock signal is to mark points in time where transmit circuits may change data signals and where receive circuits may sample data signals. In a properly designed circuit the sampling time occurs after a changed data signal settles to a steady-state value. A transmitter may use a clock signal to trigger when it drives a line to signal state H or L; a receiver circuit may use a clock signal to latch the data signals into a register. A common convention is that a receiver latches data on the rising (0 to 1) transition of its clock signal, while a transmitter updates its outputs at the falling (1 to 0) transition of its clock signal. These signal state transitions take a finite amount of time to complete but if the rise and fall intervals are short compared to the interval used to represent a bit, the transitions may also be referred to as “edges”.
If a clock signal is shared amongst multiple transmitters and receivers, then they are said to be synchronized and the data transfer is generally referred to as “synchronous” data transfer. “Asynchronous” data transfer is simply any scheme where data signals may be transmitted and received without the use of a common clock signal. An asynchronous receiver is more flexible for sampling data signals than a synchronous receiver. In particular, it may sample and latch data at timepoints that are quite different from its local clock signal. Some asynchronous receivers work by oversampling the input to look for data signal transitions. Simpler asynchronous receivers accept a clock (or strobe) input signal that originates with the transmitter and is carried along with data; the strobe input latches the data at the front end of the receiver and it is then buffered and retimed for synchronous outputs.
Data flow on a link may need to be interrupted by either the transmitter or receiver. If the transmitter temporarily has no new data to send, the receiver may erroneously keep reading the last bit of data unless it gets a not-ready signal from the transmitter. Similarly, if the receiver temporarily has no place to put data, it may erroneously ignore arriving data unless it can tell the transmitter to stop sending. Interconnection networks may have special signals devoted to flow control and protocols for what nodes are supposed to do when these signals change state. These special signals may be wires in the link itself or they may be code patterns in the set of wires. Protocols are implemented with simple state machines.
In a typical microprocessor IC chip the data transfers are synchronous. However, the pursuit of higher performance (computations per second) has pushed clock frequencies higher and higher (currently around 2 GHz). Clock frequencies this high are reasonable inside an IC where wires are physically short, but are difficult to manage for the chip I/O and inter-chip links.
Signals propagate on circuit boards at very high speeds (on the order of 4-6 inches per nanosecond), but for fine wire “traces” on a circuit board, a transmitter can develop rise and fall times shorter than a nanosecond. With fast enough rise and fall times, several clock/data transitions (or edges) may be in transit on the signal wires between IC chips at any given moment.
On almost any microprocessor IC chip the clock signals used with the chip I/O circuits are not as high as the clock signals used in the core. High bandwidth, on the order of 4 Giga words (16 bit each word) per second, between nearby chips on a circuit board may be obtained with parallel-wires, low-voltage differential signaling (LVDS) and synchronous data transfer. Between circuit boards, high bandwidth may be obtained with parallel wires or optical fibers and synchronous or asynchronous data transfers. Specialized circuits and controllers are used with external memory chips, such as the popular double data rate (DDR) series of interfaces. Specialized circuits are also used for high speed bit-serial communication, such as serializer & deserializer (SERDES) circuits.
To build large systems composed of multiple VLSI chips and synchronous parallel inter-chip communication, IO clocks are preferably generated in such a way that they will be synchronized across multiple IC chips. Typically this is achieved with a phase-locked loop (PLL) in each chip. The PLL maintains a constant averaged phase relation between a clock reference signal generated externally and the clock signals inside the chip. Typically the reference clock frequency is much lower than the internal clock frequencies in order to limit bandwidth and noise introduced into the reference clock signal, and/or to use the output of crystal controlled oscillators.
Multi-Frequency Clocks
The PE, SM, IN, and clock distribution network for an MPA need to be more power efficient per processor than for conventional microprocessors, simply because there are 10 to 100 times more processors in each MPA IC chip, and a reasonable chip size and package for it have a limited capacity to dissipate heat.
MPA clock distribution and its control mechanisms also should be more flexible because with larger numbers of processors there is greater fluctuation in the instantaneous demand for their operation.
In multi-processor systems, processors can be configured to control the supply voltage and clocking frequency of other processors for the purpose of conserving overall power dissipation. A simple approach is to turn off the clock to processors that are temporarily not needed and for longer intervals to turn off their power. A more sophisticated approach involves preparing processors at low speeds for use at high speeds.
For a processor and memory, turning power back on and resuming processing is much more complicated than turning it off. When power comes up the processor is in a random state that requires a reset of the circuits followed by clock turn on. Then an initialization sequence is required to bring the processor to a known ready state, reload support memory, and begin execution of application software.
If all of this takes too long for the application, then it may be useful to prepare a processor at a low clock frequency (conserving power), so that it may resume full speed operation with only a few microseconds of advance notice.
Power Consumption
To see how energy can be conserved with parallel computing, we briefly review the ways that digital CMOS circuits use power. Basically the average power use depends on supply voltage and clocking frequency.
In CMOS digital circuits logical ones and zeroes are represented by high and low voltage levels on signal lines. The state of a signal line is high or low. Power is used to change (or switch) the state of each signal, otherwise the circuit sits in a quiescent state that dissipates a much smaller amount of power, which is due only to leakage currents. The energy required to switch a signal line from high to low or low to high is mostly proportional to the total electrical capacitance, C, of the line and the transistors connected to it. The power supply current required by a transistor to switch a signal line at first surges and then decays—much like the current through a switch to charge a capacitor. The integrated current through the transistor for the switching event (in amp*seconds) is equal to the change in the charge, Q, on the total capacitance, C. From the physics of capacitors, Q=C*V where C is in farads and V in volts. Repeated charging and discharging at some frequency f results in an average switching power of:
Pavg=I*V=f*C*V*V=f*C*V2 
This linear relation of power consumption to frequency holds over a wide range, many orders of magnitude. At very low frequencies there is a power floor where the dc leakage currents will dominate the overall power consumption. At very high frequencies the transistors are not fast enough to completely switch the signal lines, and this causes bit errors and excess supply current. Often the bit errors can be suppressed by increasing the V of the supply but this causes a quadratic increase in power until the circuits are damaged by overheating.
If a CMOS circuit does not need to run fast, then Pavg can be reduced by operation at lower frequency, and further reduced by reducing the supply voltage. However, operation at lower voltages results in less charge/discharge current per transistor. Below a threshold voltage, Vth, the transistors are off (except for tiny sub-threshold currents).
Energy Saving Opportunity for Parallel Computing
The opportunity for parallel computing is that computations per unit energy are lower than with a unitary processor. To see how this is so, consider a computation that requires 1 billion operations. On a unit processor at 1 GHz this may take about 1 s at a power supply of say 100 W (averaging 500 mA at 2V) or about 100 joules of energy. If 100 processors of the same type and power supply are used, the computation time may be reduced, ideally by the number of processors, but due to communication overhead, a reduction of 50× to 20 ms is more likely. The energy required has doubled because there are 100 times as many processors running at 1/50 the time interval. However, we can slow the processors down by 50× to 20 MHz and complete the fixed computation in the original 1 s interval. This reduces the power dissipation per processor to 2 W.
But now the supply voltage can be reduced because the transistors do not need to charge and discharge the capacitances so quickly.
Actual IC chips may have minimum supply voltage specifications that are closer to about half of the maximum supply voltage specification, often due to internal circuits designed for high speed.
Generalizing: With N times as many processors at work on a large computation, and the same amount of time to complete it, the clocking frequency, F, can be reduced by a conservative estimate of ˜2/N, and then Vsupply can be reduced by about a factor of two for 10<N<100. The average dynamic power per processor is reduced by (Fp/Fs)*(Vp/Vs)^2, where the p subscripts refer to the parallel computation and the s subscripts refer to single processor computation. So, for the N processors the typical dynamic power reduction compared to a single fast processor is:
Pp/Ps=N*2/N*(½)^2=½
Also the static power consumption due to leakage currents may be reduced by lower supply voltages as well.
This strategy has its physical limits, of course. With Vdd a few tenths of a volt above transistor turn-on voltage (Vth) the statistical scatter of Vth becomes a limiting factor. Future improvements in fabrication technology may reduce the scatter of Vth.
In an MPA, additional power savings can be made in the clock distribution network itself if the requirements on clock skew between distant parts of the array can be relaxed. This is possible in MPAs where most signal paths are short, connecting only to nearby circuit blocks. For example, the HyperX architecture (ref U.S. Pat. No. 7,415,594) has this property that a very high percentage of the signal paths are short in length.
Exemplary Multiprocessor IC
FIG. 1 illustrates an embodiment of a multiprocessor IC for the purpose of illuminating clock distribution network design issues/problems addressed by an embodiment of this application. As illustrated in FIG. 1, exemplary hx3100A multiprocessor IC comprises an MPA, which receives as inputs a clock signal CLK1 and a synchronizing signal SYNC. The CLK1 and SYNC signals are generated by a CLK1+SYNC Generator. The CLK1 +SYNC Generator receives as inputs a clock reference signal CLKREF, a clock bypass signal Bypass, and a system synchronization signal SYNCIN. Other inputs and other components present on the hx3100A multiprocessor IC are not illustrated. Clock reference signal CLKREF is a system reference clock that may be used to synchronize operations between different chips, and is illustrated in FIG. 1 as being generated by an oscillator OSC1. Components in this and other figures are not shown to scale.
The MPA of the hx3100A multiprocessor IC has a 10×10 array of PE that are interspersed in an 11×11 mesh of nodes of an interconnection network (IN). Each IN node contains shared data memory (DM) to support the neighboring four PE; and each PE may access shared DM in the four neighboring nodes surrounding it. Each PE has private instruction memory (IM).
The chip is divided into four quadrants for internal dc power supply distribution; the positive side of the power distribution network is divided into four “voltage islands” that may be separately coupled to external power supplies. The negative side of the distribution network is coupled to system zero reference “ground.”
The circuits crossing the boundaries between quadrants may be designed simply to operate with adjacent voltage islands at the same voltage and to self-protect when one voltage island is switched off. The circuits crossing the boundary may be made further capable of operation with adjacent voltage islands at different non-zero voltages with the addition of level-shifting circuits. Level shifting circuits are well known in the industry, and easily added, but they may introduce additional power dissipation and signal delay.
The clock distribution network for the hx3100A chip supports moderately large (16×) frequency differences between the processors and their supporting memory (SM) elements and interconnection network (IN) while maintaining an overall synchronous array. All processor memory accesses and data transfers in the core array occur in step with a global clock signal.
The hx3100 has a clock tree with distributed regenerators architecture. It distributes a clock signal to every part of the chip with relatively low power dissipation while limiting clock skew between PE and local nodes. An H tree was also considered, but it would have had more regenerators than the tree chosen, and thus would dissipate more power. The disadvantage of this tree compared the H tree is that the central area has a clock signal that is skewed (phase advanced) in steps with respect to the perimeter of the chip. However the multiprocessor architecture for which it is designed has mostly short links and connections to nearest neighbors, and thus good tolerance of the skew between steps.
FIG. 2 shows that the chip is divided into a checkerboard of macrocells, each served by a regenerator output, and having a uniform clock signal phase and internally synchronous operation.
In the concept of concentric window-frame time zones, centrally located zones may tap off the clock network closer to its root. The overall effect is that fewer regenerators are needed vs. the H-tree. The circles in the diagram represent regenerators. Each regenerator has one or more outputs to drive other regenerators and/or macrocells (checkerboard squares). Each output to a macrocell has a configurable divide and delay cell (not shown in the figure). The global clock signal CLK1 and synchronization signal SYNC are generated at the edge of the chip by the CLK1 +SYNC Generator, and are communicated to the central clock regenerator.
The central clock regenerator distributes clock and sync in four directions to each of the four quadrants of the chip and to additional regenerators in each quadrant. Additional branches are added as the tree extends toward the perimeter of the chip. Except for the central clock regenerator the regenerator cells have outputs for local macrocells. The tree builds up a series of time zones shaped approximately like concentric window frames—though each frame need not have exactly rectangular boundaries or make a complete loop.
On the hx3100A chip, a macrocell may be composed of one PE and one IN node, the IN node containing a DM and a Router and also referred to as a data-memory router (DMR). On other types of chips a macrocell may contain different numbers of these elements.
The hx3100A clock distribution network provides a selection of clock frequencies for each PE while maintaining a uniform high frequency for the DMRs. Individual PEs may be configured to operate at reduced clock frequency using clock dividers located in the regenerators.
Power-of-two fractional frequencies (1/(2^N)) are easily generated with a binary counter of length of N bits as illustrated in FIG. 3. The hx3100A chip regenerators use a 4 bit counter and an output selector so that fractions of ½, ¼, ⅛, and 1/16 are supported. If the counter is excessively long (to cover a wider range of frequencies) it begins to take up excess silicon real-estate and adds to leakage power dissipation.
The SYNC signal is not a clock but a pulse one CLK1 period wide that is broadcast with CLK1 on every 16th cycle of CLK1 and it is used to synchronize the PE clock dividers in the regenerators, as shown in the waveforms of FIG. 4. As shown in FIG. 3, SYNC is used to reset the counters every 16 cycles. Without the SYNC signal each divider may have started counting at a different time and therefore the different counters may be out of phase with each other in increments of CLK1 cycles. RegP is the configuration register for the regenerator, and it is accessible by application software. Updates of RegP outputs are aligned to the SYNC signal.
Data and address buffers are located between PEs and DMRs and between DMRs to hold data during stall intervals. While originally used to control the flow of data, the same mechanism aids the interface of slowed PEs to full speed DMRs.
DMRs are not run slow so as to maintain the bandwidth of the interconnection network; but they can be suspended (clock input halted). Normally the DMR power dissipation also varies with request rate, and if neighboring slowed PEs are making requests at a slower rate, the DMR power dissipation will also decrease.
FIG. 5 shows a way to generate the global CLK1 and SYNC signals that are used on the hx3100A. The PLL is configured by chip inputs. When the PLL is activated it will, after many cycles, phase lock to the average frequency and phase of chip input CLKREF, a square wave. The output of the PLL is shown as the highest frequency clock (HFC), also a square wave, and it may have a frequency that is typically 8 to 128 times higher than CLKREF depending on configuration.
Multiplexer M1, configured by software-accessible Reg0 through Logic1, selects either HFC or CLKREF input, and outputs CLK0 signal coupled to clock divider DIV1. Clock divider DIV1 is configured through Logic1 to produce same or reduced frequency CLK1 which is the highest frequency clock signal sent into the core array. A counter, CNT0 , and logic gate, NOR1, may be used to generate the SYNC signal.
The counter CNT0 may be periodically reset by the chip input signal SYNCIN. In a multichip system, one hx3100A may be selected to have a master CNT0 , and the other hx3100A chips may be slaved to it by receiving a SYNCIN signal from the SYNCOUT signal generated by the master CNT0 . However, at high clock rates it is difficult to align the phase of the inter-chip sync signals to properly reset CNT0 , which is running on a clock phase locked to CLKREF. Also, any DIV1 I/O frequency ratio other than unity results in possible phase offsets between the internal SYNC signals of the chips of multiples of the HFC cycle.
In most multichip systems, the interconnections between chips are operated for data transfers at lower rates than the on-chip interconnections are operated. This is done both for signal integrity and power dissipation reasons. If the CLK1 on both chips is adjusted down to a rate that the interchip connections can support without distortions, then reliable synchronous communication between the chips can commence. However, this limits the speed of the PEs and DMRs in the core of the chip and thus the range of applications. Thus there is a need to slow the clocks of the I/O cells relative to CLK1. Benefits of slowed I/O cells are that for slowdown ratios less than about 1000, their power dissipation comes down almost proportionate to the slowdown ratio, and the timing margins improve as the data pulse widths increase.
In the hx3100A chip, an I/O cell receives a clock signal from the last regenerator in a clock distribution branch and from an output that would have gone to a PE had one been located in the I/O cell location. The regenerator contains a clock divider that takes CLK1 and SYNC inputs. Thus an I/O cell clock rate may be configured in the same way as a PE clock rate, and be configured to a clock rate slower than CLK1 , as desired for interchip connections. Internal to the chip, an I/O cell clocked this way maintains synchronous communication with the nearest DMR and through the on-chip network (IN) to the rest of the DMRs and PEs inside the chip. Flow control between the I/O cell and the DMR prevents data loss or duplication; however, a data jam may result if a slowed I/O cell is sent data at a higher rate than it can process.
While the input of a shared clock reference signal (CLKREF) to the PLLs of the two chips provides CLK1 phase stability and phase stability between the SYNC signals of the two chips, the sync generators of both chips' CNT0 counter would have to come out of reset on the exact same cycle of CLK1 for the SYNC signals of the two chips to be exactly aligned. If one reset signal is delayed (or “skewed”) relative to the other by as little as a half cycle of CLK1 , then the two CNT0 counters may lock-in a full CLK1 cycle of skew between the SYNC signals, which erodes timing margins for signals between the chips. In general, a skew of the reset signals by an interval t will result in a skew of n cycles of CLK1 in the SYNC signals, where n=t/tper rounded to the nearest integer value n, and tper is the period of CLK1. Therefore, a new approach is desired.