Not applicable.
1. Field of the Invention
The present invention generally relates to a computer system comprising a plurality of pipelined, superscalar microprocessors. More particularly, the invention relates to communication of data between multiple processors. More particularly still, the invention relates to the recovery of data transmitted in an asynchronous clock domain along different point to point data paths between processors.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system. This is especially true for computationally intensive applications and applications that otherwise can benefit from having more than one processor simultaneously performing various tasks. It is not uncommon for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. As processor and bandwidth capabilities increase, the size of the data and command packets also increase. In transmitting this information between processors, it may be desirable to deliver these data packets in contiguous form. That is, the data is preferably transmitted in parallel between respective processors. To accomplish this, signal paths between the processors must exist for each bit of information in a packet. A 32-bit long packet therefore would require 32 separate signal paths between processors.
Routing of multiple, parallel signal paths is difficult in congested printed wiring board configurations. As more components are added to circuit boards, little room is left for signal traces, especially multiple traces that are preferably parallel and of equal length. These routing difficulties exist even in multi-layer board designs. It may be difficult to guarantee that individual bits in a data packet sent at the same time from one processor will arrive at their destination at the same time because signal trace lengths are rarely equal in length. In an extreme case, it may be desirable to intentionally divide the signal paths for a single packet into multiple branches since routing of smaller sub-branches may be easier than routing all the signal paths together. For instance, the 32-bit packet discussed above may be split into two 16-bit packets. Splitting data in this manner makes trace routing less troublesome, but raises issues of signal integrity because the separated signals must be recombined at the destination to form the original data packets. One way to help ensure the data is captured correctly is to send a clock signal with each branch of the data packet. The clock signals may be used to locate data transitions and to account for differences in path lengths between the branches of the data packet. The clock signal may be sampled at the receiver to locate clock edges and correctly extract the data. A mechanism must be created to receive data from these two 16-bit branches and recombine the data into its original, 32-bit form.
The above problem is exacerbated if the data is transmitted at a clock frequency that is different from the processor""s internal clock frequency. The receiver must not only recombine the data that has been split among different transmission paths, it must also read and hold the data until the processor is ready to pull the data into the internal clock domain. A number of problems may arise in accomplishing these steps. First, there is no guarantee data that was aligned as it left the transmitting processor is aligned when it arrives at the receiving processor. Second, if a clock signal is sent with each transmission path, there is no guarantee that the receiving processor will obtain the same result from sampling the separate clocks. For example, in the example given above where the 32-bit packet is divided into two separate paths, the clock signals from each 16-bit group may be sampled at exactly the same time, but because of skew, different results may be obtained. Even if one could guarantee that the data in the two separate branches of the packet arrive at exactly the same time, the clock signal for one branch may be sampled before a clock edge while the other may be sampled after a clock edge. The end result may be incorrectly combined data. Thirdly, because of the asynchronous nature of the transmitted signals, it is highly likely that in waiting to pull captured data into the processor""s clock domain, the captured data may be overwritten by incoming data. While buffers may be used to solve these timing and skew problems, unwanted latency delays may be induced.
It is desirable therefore, to develop a data capture scheme that successfully reconstructs and re-synchronizes data at a receiving processor. The capture scheme preferably offers reliable data transfer between processors while minimizing latency and maximizing bandwidth. The capture scheme may also indirectly improve the manufacturability of printed wiring boards and processor hardware by easing the requirements for parallel, equal-length data paths.
The problems noted above are solved in large part by an input data recovery scheme that may be implemented in a multiprocessor system comprising a communications link configured to transmit data packets from a transmitting processor to a receiving processor. The communications link includes a conduction path for each data bit in the data packet. The conduction paths are grouped into separate bundles and routed along different paths and a forwarded clock signal is sent with each bundle. The forwarded clock signal is transmitted on a differential pair of conduction paths. At the receiving processor, the data in the separate bundles is recombined to recreate the original data packet. The processors operate with a clock frequency that is at least three times as fast as the clock frequency of the forwarded clock signal and data is transmitted on both rising and falling edges of the forwarded clock signal.
The receiving processor contains a recovery circuit which samples the forwarded clock signals to locate corresponding clock edges in the separate forwarded clock signals to indicate when the data on the conduction paths may pulled into the processor clock domain. The recovery circuit includes a delay locked loop (xe2x80x9cDLLxe2x80x9d) circuit, a sampling circuit, a finite state machine, and data capture logic. A DLL circuit is coupled to each forwarded clock signal to create a delayed copy of the forwarded clock signal. The clock signal is delayed so that the clock edges in the delayed clock signal are aligned with the center of the data window for data transmitted with the forwarded clock signal.
The recovery circuit also includes a sampling circuit configured to sample the delayed clock signal at the processor clock frequency to locate rising and falling edges in the delayed clock signal. The sampling circuit comprises a chain of flip-flops configured to sample the delayed clock signal and generate a string of sequential samples of the clock signal. The sampling circuit also includes a bank of logic gates configured to set a bit at the output of one of the logic gates indicating that an edge transition occurs between any two of the three sequential samples. Shift registers are coupled to each logic gate and are configured to shift the output of the associated logic gate at every processor clock cycle. A multiplexer is coupled to each shift register and is configured to extract data from a bit location in the shift register as specified by a clock ratio input. This clock ratio is based on the ratio of the transmission and processor clock frequencies and also on the length of the flip-flop chain through which the clock signals are sampled. This information is used to take advantage of the periodic nature of the forwarded clock signal and allows the current data packet to be extracted using a clock edge in the past. This eliminates the need for buffering and unwanted latency that may occur in allowing for worst case skew conditions between the separate data bundles.
The recovery circuit also includes a finite state machine coupled to each sampling circuit. The state machine identifies when corresponding rising or falling edges have been sampled for each delayed clock signal. The finite state machine comprises input logic that is coupled to the outputs of each sampling circuit. This input logic is configured to indicate if the sampling circuit has detected a rising edge, a falling edge, or no edge in the delayed clock signal. The state machine uses information from the input logic in transitioning between a plurality of states. Each state is reserved for a condition where edges of a certain type and from a certain source are expected. The state machine also includes output logic that is configured to receive signals from the input logic and from the state machine. Transitions between states in the state machine occur when expected edge types are found and generate a pulse that is sent to the output logic. If the signals to the output logic are sufficient to indicate that expected rising edges have been found from all delayed clocks, a rise command is output. Conversely, if the signals to the output logic are sufficient to indicate that expected falling edges have been found from all delayed clocks, a fall command is output.
Lastly, the recovery circuit includes data capture logic configured to sample and hold the data bits on each conduction path in the communications link. When the finite state machine issues a command indicating that all falling or rising edges have been found, the data capture logic delivers the data as a complete packet to the processor clock domain at the subsequent processor clock edge. The data capture logic comprises rising and falling capture latches operating at the delayed clock frequency. These latches sample data from each conduction path on rising and falling edges of the delayed clock signal, respectively. The data capture logic also includes a multiplexer configured to select between the output of the rising and falling capture latches. Selection logic is used to detect the rise and fall commands from the finite state machine. When the finite state machine issues a rise command, the selection logic delivers a signal to the multiplexer to select data that is sampled by the rising capture latch. Conversely, when the finite state machine issues a fall command, the selection logic delivers a signal to the multiplexer to select data that is sampled by the falling capture latch. The output of the multiplexers is delivered to a recovery latch operating at the processor clock frequency. The recover latches are enabled only when a rise or a fall command is issued by the finite state machine. Once enabled by the rise or fall commands from the finite state machine, the recovery latches pull the data from the multiplexers into the processor clock domain at the next appropriate clock cycle. In this manner, the original data packet is successfully transmitted across the data communications link and synchronized into the appropriate clock domain.