Not applicable.
Not applicable.
1. Field of the Invention
The present invention generally relates to a computer system comprising a plurality of pipelined, superscalar microprocessors. More particularly, the invention relates to communication of data between a core logic chipset and multiple processors. More particularly still, the invention relates to the recovery of data transmitted along different point-to-point data paths between components in the chipset and the processors.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system. This is especially true for computationally intensive applications and applications that otherwise can benefit from having more than one processor simultaneously performing various tasks. It is not for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. As processor and bandwidth capabilities increase, the size of the data and command packets also increase. In transmitting this information between processors, it may be desirable to deliver these data packets in contiguous form. That is, the data is preferably transmitted along parallel data traces between respective processors. To accomplish this, signal paths between the processors must exist for each bit of information in a packet. A 32-bit long packet therefore would require 32 separate signal paths between processors.
Many modern multi-processor systems rely on a core logic chipset to literally direct data traffic between processors and the outside world. A conventional core logic chipset includes, among other things, a memory controller and I/O interface circuitry. Older chipsets would also control cache memory, but newer designs are delegating this role to the processors to which the cache memories are connected. To improve bandwidth and reduce latency, chipsets are being designed with point-to-point, switched data transfer architectures rather than shared bus architectures. The switched architecture allows direct connection between two devices and aids performance by allowing for higher clock rates and also permits scalable bandwidth.
To take advantage of the direct, point-to-point connections between devices in the chipset, a clock forwarding technique is commonly used. In this technique (sometimes referred to as a source synchronous technique), timing signals are sent in parallel with data signals. This is compared to the method where the destination device samples the incoming data using a clock internal to the destination device and that is asynchronous to the incoming data (i.e., rising and falling edges do not align with respect to time). In the clock-forwarding scheme, the clock and data are fully synchronized, which permits more efficient data extraction by the destination device.
Clock forwarding transmission schemes work by sampling the incoming data at the receiving device using the corresponding forwarded clock signal. The receiving device commonly employs a latch or series of latches (flip-flops) to sample the data. The latches are triggered using the forwarded clock such that the data is pulled into the receiving device at the appropriate rising or falling edge of the forwarded clock signal.
The data sampling latches used in this transmission scheme require that the data be present at the input to the latch for a minimum amount of time before and after the latch is triggered by a forwarded clock edge. This is referred to as the setup and hold time requirements for a latch. The setup and hold requirements, if met, guarantee that the data is sampled reliably. If this setup and hold time is violated the sampled signal becomes unstable and unreliable. In actual implementations of the clock-forwarding scheme, the forwarded clock must be delayed slightly to guarantee that the data arrives at the sampling latches before the corresponding clock edge arrives. This timing adjustment is referred to as clock tuning. Clock tuning is typically implemented by adding etch to the clock signal trace. If enough tuning etch is added, the setup and hold requirements of the sampling latches can be met and the data can be reliably extracted by the receiving device.
The process of tuning a forwarded clock is iterative and can be cumbersome. Theoretical values for the required length of tuning etch are determined before hand based on the length of the data paths. Computer aided design (CAD) designers can lay in additional etch to the forwarded clock traces, but tests must be run on actual hardware to determine if more or less tuning etch is needed for a given data group. The designs are then altered in the CAD database and the process is repeated. The tuning process is therefore time consuming, tedious, and error-prone.
Modern core logic chipsets include a number of devices, each capable of transmitting data to and from a processor. For example, the Compaq 21264 Alpha processor has employed a core logic chipset that includes ASIC chips capable of transmitting 64-bit data bundles to four separate processors. Transmitting 64 bits of data in parallel can monopolize a large amount of real estate on a system board or motherboard. In many cases, the data bundles are separated into sub-bundles to allow for more efficient use of board space. In such cases, each sub-bundle is transmitted with its own forwarded clock to guarantee reliable data transmission.
One drawback to separating the data bundles into sub-bundles is that each forwarded clock must be tuned individually. Since the routing path for each sub-bundle will invariably be unique, the amount of tuning etch needed for each forwarded clock will be different. In a multi-processor system, this problem quickly grows into an enormous task. If we assume a 64-bit data bundle is separated into 8 sub-bundles and our system has four processors, we quickly find that there are 32 separate forwarded clock traces that must be individually tuned. This number is effectively doubled if you consider the tuning required for the forwarded clocks associated with data transmission in the opposite direction. Not only does this tuning require a large amount of board real estate, but also time and money.
Another consideration in a clock-forwarding transmission scheme relates to skew problems. Since board area is needed to allow for etch tuning of the forwarded clocks, the most direct data path from source to destination is not always used. This results in skew between the data sub-bundles. That is, the data sub-bundles arrive at their destination at different times. This creates latency delays due to the additional time required for the receiving device to reconstruct the original data bundle from the sub-bundles.
It is desirable therefore, to develop a data transmission scheme that successfully eliminates the quantity of tuning etch required to reliably sample data at a receiving device. The transmission scheme preferably offers reliable data transfer between devices while minimizing latency and skew and maximizing bandwidth. The transmission scheme may also indirectly improve the manufacturability of printed wiring boards and processor hardware by easing the requirements for parallel, equal-length data paths. Design times may also be advantageously reduced by eliminating much of the iterative process required in tuning forwarded clock paths.
The problems noted above are solved in large part by a clock forwarding scheme for use in a system comprising a plurality of communications links, each link configured to transmit data packets from a transmitting device to a receiving device. Each communications link includes a conduction path for each data bit in the data packet and at least one conduction path for a forwarded clock signal that is synchronously transmitted with the data packet. Tuning etch is eliminated from each individual forwarded clock path. Instead, the required setup and hold delay in the forwarded clock signal is generated at the transmitting device by adding tuning etch to the signal path for all forwarded clock signals prior to transmission of the forwarded clock signal and data bits. In other words, a single tuning etch is needed instead of one tuning etch for every communications link. The forwarded clock signal and data are advantageously transmitted via conduction paths in each communications link that are substantially parallel and of equal length.
Preferably, the plurality of communications links are of equal or similar length to eliminate or reduce skew.
The setup and hold delay is added upstream of the conventional location. The source device preferably has at least two clock output pins to deliver two synchronous clock signals off the device and at least two clock input pins to receive the clock signals. Termination circuits are coupled to these clock signals for adjusting duty cycles of the clock signals and improve symmetry of the forwarded clock and data signals downstream at the destination devices. One of the two clock signals is delayed with respect to the other via a longer tuning etch path between the output pins and input pins on the device. The delayed clock signal is used to trigger logic to transmit a forwarded clock signal to the plurality of communications links. The undelayed clock signal is used to trigger logic to transmit data bits to the plurality of communications links. These clock signals are used to trigger the output logic for each output port in the source device. In the preferred embodiment, a single tuning etch advantageously replaces four individual tuning etches that are typically associated with conventional source synchronous clock forwarding schemes.