The system clock distribution for contemporary computer systems faces difficult problems as the system size and integration density of the Integrated Circuits (ICs) increases as parallel multiprocessor computer architectures are employed along with the decreased minimum feature size in IC fabrication. It is known that using a synchronous system clocking scheme is most advantageous for the ease of design and performance improvement of the computer systems. However, it has been impossible to distribute the global system clock on the entire system or the subsystem of the computers when the fast clock signals (e.g., over 200 MHz) must be distributed over long interconnection distances and with a large fanout (e.g., over sixteen), and with small clock skew and rise time. Multichip Modules (MCMs) are well recognized as one way of solving signal interconnection problems by placing multiple bare dies directly into a module to eliminate the packaging overheads and reduce interconnection distances.
The system clock frequency of a computer system represents the rate of data processing for CPU or the rate of data transmission for I/O and memory. Thus increasing the system clock frequency directly enhances the throughput of the computer system. For this reason there has been extensive research efforts to achieve faster system clock frequencies using (1) faster logic families (i.e., CMOS&lt;ECL&lt;GaAs), (2) faster storage elements such as latches or flip-flops, (3) robust clocking schemes, and (4) equidistant clock distribution to minimize clock skew.
As faster silicon (Si) and gallium arsenide (GaAs) integrated circuits (ICs) are employed, along with parallel and interconnection-intensive processing architectures, signal interconnections between integrated circuit components become a performance-limiting factor of modern high-speed microprocessors. The signal interconnection problems become worse as computer systems evolve toward higher capacity multi-processing architectures such as high performance distributed multiprocessor computer systems (DMCS). As DMCS size grows, relative signal interconnection delays of DMCS become so large that signal interconnection becomes a major limitation of faster system throughput. This difficult signal interconnection issue of large DMCS makes "scalar" improvement of the computer system throughput nearly impossible as the system size grows with multiple processors.
Several methods have been recognized to resolve signal interconnection bottlenecks of DMCS such as efficient packaging, introduction of innovative technologies (optical or superconductor interconnections) into interconnection networks, and efficient system architectures with emphasis on reducing signal interconnection delays. Efficient packaging technologies such as multichip packaging (multichip modules or water scale integration) have been studied and implemented to reduce the electrical parasitic characteristics associated with individual chip packaging.
In MCMs, bare dies of ICs are directly mounted on a common module to reduce one level of packaging hierarchy and thereby enable overall computer size compactness. The inter-chip interconnection distances and chip packaging parasitics are reduced allowing better signal interconnection performance in multichip packaging. Innovative technologies such as optical interconnection and superconductivity wiring have the potential of greatly reducing the role of signal interconnection as a limiting factor in digital system speed in the future. The network bandwidth of these technologies are wide enough to meet computer systems signal interconnection requirements well into the future. In particular, optical interconnection methods can be a very feasible solution to the increased demand for computer speed once successful and reliable device fabrication and integration are realized.
A key signal interconnection required in synchronous digital systems is clock distribution. Clock signal behavior is more restrictive than general I/O signals due not only to high fanout, but also to the fast risetime and small skew required. System clocking provides the fundamental timing for the computer system. The clock frequency determines the rate of data processing for the CPU, access speeds for memory and rate of data transmission in I/O of subsystems. The computer system throughput is proportional to instruction and data transmission rates and these rates are linearly related to the clock frequency. Thus the computer system throughput becomes scalable as the system clock frequency increases.
Clock skew is formally defined as the time difference between the switching points on the rising edge of the clock waveform at different fanout nodes of the clock distribution network. This is normally caused by different interconnection distances and different electrical wire characteristics such as wire widths and thicknesses, materials or cross coupling noise with neighboring wires. The control of clock skew over the physical dimensions of a modern, large and high speed computer system is very difficult because high performance computer systems tend to have higher clock frequencies, larger chip sizes with long signal paths, larger RC delays with smaller feature sizes and unpredictable layout configurations for different ICs. An example of a low skew clock is one in which the skew is less than ten percent (10%) of the clock period.
Global clock signals are distributed at high frequencies by single source interconnects which operate as transmission lines. The global clock distribution system includes three major components; a clock driver, a distribution network and a clock receiver. The clock drivers can be either on-wafer or off-wafer drivers. The local clock lines on a MCM can be buffered to compensate fanout power limitations when active components are built on the MCM in addition to passive components like multilayer metal interconnections. However, controlling the multiple high-speed buffers at different MCM sites is hard to do for large MCM size and a large number of buffers. Different butter delays among local buffers could result in intolerable clock skews at the end of clock lines.
The electrical H-tree clock distribution configuration is well known in the VLSI design community as a way of eliminating clock skews by making the interconnection distances equal for all nodes from the clock generator. But the most elaborate H-tree electrical clock distribution on MCMs is still limited by driver capabilities and transmission line properties such as reflections, crosstalk, line resistances, line capacitances and characteristic impedances. The electrical system clock distribution on MCM is limited to less than or equal to sixteen (16) nodes depending on the clock frequencies (50 to 200 MHz). Attaining high frequency synchronous clock distribution using electrical interconnections on MCM is difficult due to large fanout and long interconnection paths. When a high fanout interconnection length is larger than approximately 10 cm at a clock frequency of 500 MHz or greater, electrical interconnection latency becomes an issue in clock distribution.
In the electrical H-tree clock distribution system, when the signal propagation delay roughly matches the signal rise and fall time, the interconnection electrically isolates the driver from the receivers. The receivers no longer behave as direct loads to the driver and thus the interconnection impedance becomes the driver loads and the input impedance to the receiver. Thus the signals can be distorted by transmission line effects producing reflections, overshoot, undershoot, ringing or crosstalk. In the transmission line design, the characteristic impedance, signal reflection at discontinuities, terminations and the ratio of line resistance to characteristic impedance should be carefully considered to insure satisfactory performance of the signal transmission which is required to match or exceed the design requirements. The characteristic impedance of a transmission line is a key determinant of propagation delay, noise levels and power dissipation of the interconnection network.
The electrical H-tree network design on MCMs additionally requires careful consideration of signal driver capabilities, MCM characteristics (substrate size, substrate material, dielectric material, type of metal), number of components on the MCM, operation frequencies of interest and so on. The control of reflected signals along a transmission line influences the signal transmission quality. Thus it is required to have impedance matchings at every possible branching point. The source-end termination is a popular clock line termination method since this allows no dynamic power dissipation. The interconnection line resistance influences the clock rise time at the receiving end and suppressing the line resistance to a small value can be achieved by scaling the clock line dimensions. For very high frequency signal transmission, the skin effects of the metal conductors could increase the effective line resistance and these should also be included for calculating the transmission line resistance. The dielectric loss can also increase the transmission line resistance. All of these three contributions to transmission line resistance should be considered and suppressed to a small value compared to the transmission line characteristic impedance so that the interconnection network can behave as lossless transmission lines.
General guidelines for designing an electrical H-tree network are: (1) ascertain the network specification such as clock frequency, risetime, falltime, maximum allowable skew, module size, number of components on MCM and type of MCM and interconnection materials; (2) identify the clocked component sizes (i.e., IC sizes on the MCM) which require a maximum clock skew less than the design specification; (3) design a candidate H-tree network by varying wiring parameters and number of unmatched branches so that the network can accommodate the design specifications such as driver capability, bandwidth specification, number of nodes, maximum allowable wire width and thickness and so on; and (4) simulate the performance of a candidate network. If it satisfies the design requirements, use this design. If not, perform steps (3) and (4) again, while using a simulator to investigate the transient response of the H-tree network.
A performance evaluation of an example electrical H-tree network uses a lossless transmission line analysis with a SPICE simulation. The line resistances of the H-tree networks are negligible compared to the characteristic impedance to model the network as lossless transmission lines. Assume that the load impedance at each line termination is 50.OMEGA., matched to the receiving buffer load impedance. The impedance matchings at branches are established by choosing appropriate wire thicknesses, widths and dielectric thicknesses. Unfortunately, if all the branches in the 16 node H-tree network are matched, the characteristic impedance at the clock driver end is 3.13.OMEGA., which is too low to drive even the more capable, high-power clock drivers implemented in ECL or advanced CMOS.
To relax the driver requirement, the network should be designed to have one or more levels of unmatched branches. Thus, a more complex transmission line design problem must be solved where effects of multiple reflections at the unmatched branch locations must be taken into account. Determine the transmission line wire thickness and width to optimally satisfy the wire dimension requirements for a single level of unmatched locations. It is generally desirable to have a small wire dimension for high interconnection density. The ratio of metal-to-dielectric thickness influences the wiring density and the minimum wire dimension is achieved by choosing it between 1.5 to 2.0. Assume a dielectric-to-metal thickness of 1.5 is chosen.
Using the SPICE simulation, as the number of levels of unmatched branches increases, beginning at the final termination nodes, the simulated signal rise time degrades and the network bandwidth decreases (although the driver power requirements significantly decrease). As the MCM size grows from 4 cm to 10 cm for the unmatched configuration (in only the final terminal branches), the signal rise time degrades from 0.63 ns to 1.32 ns to reduce the network bandwidth. For a 16 node network on 8.times.8 cm.sup.2 MCM, the maximum clock frequency supported is about 300 MHz with unmatched branches at only the final terminal branches.
The electrical H-tree clock distribution network on MCMs has a hard limit on the fanout due to driver capability, network bandwidth requirement and limited signal layers on the silicon MCM. The best ECL and advanced CMOS drivers have driver resistances in the range of 5 to 7.OMEGA. and 10 to 20.OMEGA., respectively. Driving a fully impedance matched 16 node H-tree on 8.times.8 cm.sup.2 MCM requires a driver resistance of approximately 3.13.OMEGA. which is impossible to provide given today's technology. Further, the wire width at the clock driver must be about 400 .mu.m which may violate wiring density requirements. Thus, for frequencies above 300 MHz, clock distribution on MCM is very difficult to achieve even for 16 nodes, let alone for a larger number.
Generally computer systems can be classified as synchronous systems or asynchronous systems according to their clocking schemes. In synchronous systems every component of the computer system is in lock-step controlled by a global system clock. Thus, there is only one global system clock generated by a single oscillator. Synchronous clocking has been the choice for computer systems with small size and single or a small number of CPUs. A synchronous system can be implemented by numerous schemes such as single phase clocking with latches, edge-triggered flip-flops, two phase clocking with single or double latches or multiphase clocking at the component input boundaries.
Asynchronous systems have multiple independent local clock sources for the system. Communications between independent subsystems are maintained by asynchronous interfaces or selftimed logic signals. An asynchronous system can be configured as locally synchronous islands communicating through asynchronous connections. In self-timed systems, self-timed signals (acknowledge and request) replace the role of clock signals. As a result, self-timed systems do not even need clock signals. A self-timed system can be configured as self-timed subsystems communicating through asynchronous protocols or self-timed signals. In either case, no external clocks are necessary.
In general, synchronous systems have better performance compared to asynchronous systems. Synchronous systems are much easier to design and test, and they use less hardware. Debugging the asynchronous interfaces between independent subsystem is usually extremely difficult, particularly for large systems. The reduction of design complexity of the computer system with better system throughput makes synchronous clocking an ideal choice. However, controlling the global system clock signals with a small skew bound at high clocking frequencies throughout a large computer system has so tar prevented practical implementation of large synchronous systems. For this reason, large distributed computer systems are typically configured at the current time as asynchronous systems at the cost of reduced performance and increased design complexity.
For distributed multiprocessor computer systems there is generally no global system clock, mostly because of the problems related in the above paragraphs. Rather, system synchronization is maintained by using a variety of message passing schemes, which introduces considerable complexity in programming and efficiently using such machines. Synchronous global clock distribution, however, is highly desirable for distributed computer systems to simplify their architecture and enable higher speed performance.
An electrical interconnection with a large fanout on a large MCM required to operate at very high frequencies has difficult problems due to limited network bandwidth and driver capability. Optical interconnection is beneficial for a large fanout signal (like a synchronous global clock signal) on an MCM since it has large bandwidth, small power consumption and large fanout capability. An optical H-tree is needed when a high speed synchronous global clock signal should be distributed on a large MCM with a large fanout (16 or more). In addition, an optical H-tree is more suitable for flexible placement of ICs on an MCM by supporting unbalanced H-trees with small design modification and essentially no performance degradation.