New computer architectures, especially those utilizing multiprocessor technologies, use high-speed serial point-to-point link interconnect strategies. These links typically have high frequency (2 GHz plus) signaling that, due to losses in the transmission media (printed circuit board and connectors), results in small signals being received at the input of the destination chip. These small amplitude signals are normally sent differentially to increase the link reliability and reduce the link's sensitivity to common-mode noise. A link module typically consists of one to twenty bits of data operating in bidirectional mode. A link module also typically includes timing recovery circuitry (multi-phase lock loop) and voltage restore circuitry (integrating receivers). This architecture has many advantages.
Among these advantages, the architecture provides a simpler chip floor plan for large, multi-core, multi-link central processing units. These advantages further include centralized link timing recovery and simplified link to cross bar synchronization. Still further advantages of this architecture include fewer phase lock loops required for clock generation, and a shared cross bar switch, link, and core clock generation. The particulars of this architecture, however, are dictated at least in part by the high-speed link requirements.
The critical signaling requirements of these high-speed links require sensitive analog receiver and transmitter circuitry plus sensitive timing recovery blocks. The placement and grouping of these analog blocks are constrained by several issues, wherein the issues include:                common analog power supplies;        common external reference components;        C4 bump density, package routing density, and pin density issues;        clock and data skew;        placement of the link receiver and transmitters relative to other on-chip input output modules like, for example, the cross bar switch;        on-chip routes and route propagation delays; and        placement within a sea of on-chip cache random access memory.These issues are aggravated by characteristics of high performance link-based central processing units.        
High performance multi-core central processing units possess characteristics that aggravate issues constraining placement and grouping of the analog blocks. For example, high performance multi-core central processing units with large amounts of on-chip cache random access memory result in large die sizes. These high performance central processing units also have many high-speed links, and an on-chip cross bar switch so as to enable the construction of larger multi-processor computer systems without the need for many external core electronic components. Large die size central processing units with six or more links aggravate the layout of the high-speed links.
FIG. 1 illustrates a typical layout scheme for a dual core central processing unit 10 with six high-speed link modules 12, a cross bar switch 14, and a large amount of on-chip cache random access memory 16. In accordance with the existing link layout architecture, each of the individual link modules 12 is placed on the edge of the die to satisfy C4 bump, package route, and co-existence within the cache random access memory constraints. Due to the large distances involved when the link modules are arranged in such long and narrow rectangular geometry, each link module contains its own phase lock loop (IO-PLL) 18. This IO-PLL 18 generates the local clock phases required to retime the incoming small analog signals, and must be placed in close proximity to the link receivers to minimize timing jitter and routing skew issues. The IO-PLL 18 clock phases are shared between all the individual bits of each link.
Due to the large number of links per central processing unit, links have to be placed on two or more sides of the die. Also, the cross bar switch 14 needs to be close to all of the link modules 12 to reduce timing and synchronization issues. Therefore, the cross bar switch 14 resides between the two central processing unit cores 22 in a central area delineated by placement of the central processing unit cores 22. The cross bar switch 14 also lies next to the system phase lock loop (SYS-PLL) 20. The SYS-PLL 20 generates timing for the two cores 22 and the cross bar switch 14.
This link 12 and core 22 layout architecture has a large number of separate phase lock loop circuits. Altogether there are seven phase lock loop circuits, and each of these phase lock loops are driven from an external global clock. This architecture has the issue of timing delay and synchronization across the die from the cross bar switch 14 to the link modules 12 at the edge of the die.
The link module 12 recovers and retimes the data at the link module 12 and close to the C4 bumps of the receivers. However, due to the long thin rectangular shape of each link module 12, extra timing skew is incurred when the individual bits are combined together to form a bus of the link received data.
An electrical block diagram 24 of a conventional link module is shown in FIG. 2. The link module contains both the receivers and output drivers. Each bit's output driver 26 consists of a source terminated differential driver with an output level pre-emphasis stage 28. The output pre-emphasis stage 28 uses a phase of the IO-PLL 30 to delay the output signal and determine the need or not for level emphasis or reduction.
Each bit in the link module will utilize either a strobed sampler or an integrating receiver amplifier 32. The timing recovery portion of the link module is also performed on a per bit basis. The IO-PLL 30 generates a four phase output clock at the incoming bit-rate. Each bit contains an interpolator block 34, which is trained during a power up period to estimate a best case strobe, and the strobe is used to latch the output of the integrating receiver amplifiers 32 with latch component 36. The deskew circuitry 38 aligns the individual bits at the outputs of the receivers to generate the final bus result.
The link layer control block 40 performs other link control and buffering functions which interface between the link electrical layer, the link protocol layers, and the cross bar switch 42.
Communication route 44 and communication route 46 in this methodology are long and accrue timing and skew issues. These routes exist between the cross bar switch 42 and the link layer control module 40 where data is being transferred synchronously with the system clock. Managing clock skew across route 44 and 46 is a difficult problem.
There remains, therefore, a need for a solution to the aforementioned problems. The present invention provides such a solution.