The present invention relates to data processing systems, and, in particular, to data processing systems that have at least two clock domains between which data items pass.
Computer communication networks are normally constructed using switch ASICs (Application Specific Integrated Circuits). Switch ASICs come in a variety of types and sizes but in general larger networks usually require a number of switching ASICs that are put together to form a multi-stage network. The performance of a network can be measured with a large number of parameters, these include; bandwidth, latency, addressing, standards compliance and many more.
Reducing message latency is becoming more important as the bandwidth of communication links and performance of microprocessors increases. Message latency is the time it takes for a communication to take place. For large amounts of data the bandwidth of the communication link dominates. For small messages the bandwidth is less important and instead it is the time it takes for data to travel along a cable, to cross each of the switching elements and the adapters interfacing to the computers at each end that dominates the final latency value.
The cable delay can be minimised by using high quality copper cable with low relative permeability dielectric insulators. There is less scope for improvement with glass fibre optic cables other than to reduce the length of the cable.
High performance Serializer/Deserializers (SerDes) are used to interface functional blocks on a switch ASIC to either a copper or fibre cable. They convert parallel data on the ASIC into a high frequency serial bit stream at the transmitting end and take the weak signal available at the other end of the wire and convert it back to a parallel received data value. High frequency locally generated clocks are required to perform this function. The clocks used to transmit and receive the data in the SerDes are usually different from the main clock used to perform the function of the switching element or adapter connected to the communication link. They often run at a different frequency and can often be completely asynchronous with respect to the main clock. This is very common for the receive clock as the phase relationship between the incoming data and the local clock, delays in the logic and the length of the cable is usually unknown or not predictable. It is often convenient for the transmit clock and the system clock to be only loosely connected as this can significantly simplify the system design at the ASIC level.
Synchronisation between clock domains is possible using sampling flip-flops such as that shown in FIG. 1 of the accompanying drawings. A flip-flop is a single bit of register state. A good sampling flip flop requires slightly different properties from a normal flip-flop. Flip-flops have a clock input (CK), a data input (D) and a data output (Q). Sometimes additional test circuitry is included and this usually takes the form of an additional input multiplexor that allows many flip-flops to be connected into a long shift register. This simplifies the process of inputting test data and outputting test results. Normally flip-flops are optimised to reduce the maximum delay from the input D to the Q output. The D input is usually sampled on the rising edge of the clock CK pin and the tighter the setup and hold window the better. Flip-flops 1 are usually constructed from two D type latches 2 and 3 placed one after the other as shown in FIG. 1. The first D type latch 2 is transparent when the clock input CK is low and the second 3 transparent when the clock input CK is high. This has the effect of sampling on the rising edge of the clock. While a D type latch is not sampling the input the circuit must have a way to remember the previously sampled value. This is normally done by feeding the output value back to the input. Sometimes this is done with a weak feedback inverter as it only has to hold the electrical charge loaded when the D input was being sampled. Without the weak inverter the charge could leak away and the stored value could be lost. FIG. 2 shows one circuit 4 for a CMOS D type latch that has a clocked feedback value onto a storage node 5. This allows a stronger inverter to be used to conditionally load the output onto the storage node when the D latch needs to remember the value that had been loaded.
A sampling flip-flop can try to load a value at the same time the value is changing. This would produce a timing violation in normal logic using normal flip-flops. In order to counter this problem, normal flip-flops define a setup and hold period around the rising edge of the clock during which the D input signal should be settled with a solid logic 0 or 1 value. If the setup/hold window is honoured the behaviour of the flip-flop is completely predictable. The behaviour of a sampling flip-flop is not predictable if the input is changing on the rising edge of the clock. The output could read one value and then change to another value some time after the rising edge of the clock. The flip-flop can be described as being metastable during this uncertain time. Like a carefully balanced inverted pendulum it could fall one way or the other. The more carefully it is balanced the longer it will hover in the inverted position before falling in one direction. Eventually it will decide but theoretically it could be undecided for an indefinite time. The probability of being undecided quickly becomes vanishingly small but there is always a finite possibility of being undecided.
It is not possible to prevent metastability but the chances of being affected by it can be reduced in two main ways.
1. Sampling flip-flops should always have a very strong conditionally loaded feedback value. The higher the loop gain, while the flip-flop is not sampling the input, the better. The loop gain can be further improved by minimising the output Q load with a small buffer and minimising the capacitive load from the loading transistors onto the storage node. This will encourage the flip-flop to come to a decision more quickly when the clock is in the hold level. Using the inverted pendulum metaphor this is equivalent to a stronger gravitational pull.
2. The other way is to increase the amount of time the flip-flop has to come to a decision. The probability of failure includes an exponential function on the time.
The time available for a sampling flip-flop to make a decision on a silicon device usually relates to the clock cycle used on the ASIC. Often this is not long enough for the probability of failure to be small enough for failure during the lifetime of the product to be highly unlikely. Synchronising flip-flops can be pipelined effectively increasing the settling time by a whole cycle for each flip-flop added in the pipe. FIG. 3 gives an example of a pipelined synchronisation scheme where two whole cycles are available to allow a metastable state to drop out into a normal state. This technique is very successful and is used in many designs to give reliable operation on asynchronous internal interfaces. The phase relationship of the two clocks at the interface can be measured using pipelined synchronising flip-flops often sampling Gray coded counters. Usually the data is passed through a short FIFO being loaded in one clock domain and read in the other clock domain.
FIG. 4 gives an example phase aligning FIFO 6 constructed from sixteen registers. In the example data is written 7 into the FIFO 6 using clock A. Each new value is writing into the next entry as shown by the “Writing With Clock A” arrow. After the last register entry is written the write pointer will wrap back to the first entry as shown by the wrapping back pointer. In the example eight data values D1 to D8 have been written with D1 the first value to be written. The data is pulled 8 from the FIFO 6 using Clock B which reads data in the same order it was written. FIG. 4 shows the FIFO 6 being half full, with eight values out of the sixteen entries having valid data. This is the depth that is furthest away from either an underflow where the reading was faster than writing causing the read pointer to catch up with and overtake the write pointer and an overflow where writing was faster than reading.
Write and read clocks do not need to have the same frequency in order to avoid underflow or overflow. FIG. 4 shows a FIFO 6 where the amount of data written is the same as the amount of data read but the FIFO 6 can be constructed to allow a different number of bits written on each A clock compared with the amount read on each B clock. This can be a simple multiple but more complex configurations are possible. The FIFO 6 is usually used to realign data from a serial bit stream and multiple bits are usually only written to allow manageable clock frequencies for a given required bandwidth. For example data could be arriving 16 bits wide and being read 33 bits wide. In this example the FIFO 6 could appear as 33 entries each 16 bits wide to the writing clock and 16 entries each 33 bits wide to the reading clock. In this case the write to read clock frequency ratio should be 33:16 if a value is to be written and read on each cycle of the two clocks.
There are other ways different clock frequencies can be managed. The communication protocol can include mechanisms to allow small variations in clock frequency. Some include a SKIP token and this can be used by the receiver to either delete an entry and reduce the probability of an overflow if the FIFO is becoming full or fail to take a value if the FIFO is becoming close to empty allowing it to gain an extra entry.
Another commonly used method is to use a faster clock for processing the data than for transmitting or receiving the data. A receiving FIFO will always remove valid data and the transmitting FIFO will always ensure there is enough data written to the FIFO to guarantee the reading clock has valid data to send.
Any data sitting in a FIFO is increasing the message latency. Some designs are not very concerned with the value of the latency and these usually choose to keep the realignment FIFOs approximately half full. However, for latency critical designs, the FIFOs should be kept as near empty as possible as shown in FIG. 5 while still guaranteeing the FIFO never underflows. Sampling flip-flops must have time to settle and they can easily need 2, 3 or more cycles to allow the sampling flip-flops enough time to remove their metastable state. For very low latency communications using multiple stages of switching elements this additional delay can affect the performance of the whole system. With high performance SerDes this additional delay can be seen on both the RX path and the TX path doubling the penalty.
A commonly used circuit to safely move data through a FIFO 20 from one asynchronous clock domain 22 into another 24 is shown in FIG. 6. The circuit passes a Gray encoded copy 26 of a write pointer 28 into the read clock domain 24, and a Gray encoded copy 30 of the read pointer 32 into the write clock domain 22. Synchronising flip-flops 34 must be used on the Gray encoded pointer values passed into the new clock domain to prevent meta-stability problems. Once synchronised into the new clock domain, the Gray values are converted 36,38 back to binary values and compared 40,42 to the local pointer value to determine whether the FIFO 20 is full or empty. On the read side of the FIFO 20, an entry can safely be read when the FIFO 20 is not empty, and on the write side of the FIFO 20, an entry can safely be written if the FIFO 20 is not full.
This method is inherently safe as there is a delay of at least three clock cycles for the passing of the pointer value from one clock domain to the other due to the synchronisers. The cost of this safety is additional latency due to the delay through the synchronisers.
In many communications ASICs, the main system logic is operated at a higher frequency than the communications links. This allows for additional packet processing operations to be performed and a side effect is that it also permits a simplification of the clock domain crossing FIFOs between the system and link clock domains.
For data passing from the link clock domain into the system clock domain, there is a guarantee that the data can be read from the FIFO at a faster rate than it is written, thus ensuring that the FIFO never overflows. There is therefore no need to pass the read pointer into the write clock domain as the logic writing into the FIFO can assume that the FIFO is never full. The write pointer in the link clock domain is passed into the system clock domain to allow a read to be made as soon as data is available. This is illustrated in FIG. 7. This implementation is a simplified case of the generic clock domain crossing FIFO shown in FIG. 6 and still suffers from the problem that there is a delay due to the synchronising flip flops which causes data to remain in the FIFO for several cycles before it can be read out.
For data passing from the system clock domain into the link clock domain, the system must ensure that there is always data to read by the link clock domain but that the FIFO does not overflow. To achieve this, the read pointer is passed from the link clock domain into the system clock domain and the logic in the system clock domain can write data whenever the FIFO is not full. This is illustrated in FIG. 8. Again, this is purely a simplification of a generic clock domain crossing FIFO shown in FIG. 6, and suffers from excessive latency due to the FIFO being kept relatively full.
The implementations discussed in the prior art section are safe methods for crossing clock domains but suffer from excessive latency due to the time taken for the pointer values to pass through the synchronising flip-flops.