In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises one or more central processing units (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communication buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication links coupled to a network, etc. CPU's (also called processors) are capable of performing a limited set of very simple operations, but each operation is performed very quickly. Data is moved between processors and memory, and between input/output devices and processors or memory. Sophisticated software at multiple levels directs a computer to perform massive numbers of these simple operations, enabling the computer to perform complex tasks, and providing the illusion at a higher level that the computer is doing something sophisticated.
Continuing improvements to computer systems can take many forms, but the essential ingredient of progress in the data processing arts is increased throughput, i.e., performing more of these simple operations per unit of time.
The computer is a sequential state machine in which signals propagate through state storing elements synchronized with one or more clocks. Conceptually, the simplest possible throughput improvement is to increase the speeds at which these clocks operate, causing all actions to be performed correspondingly faster.
Data must often be communicated across boundaries between different system components. For example, data may need to be communicated from one integrated circuit chip to another. In countless instances, an operation to be performed by a component can not be completed until data is received from some other component. The capacity to transfer data can therefore be a significant limitation on the overall throughput of the computer system. As the various components of a computer system have become faster and handle larger volumes of data, it has become necessary to correspondingly increase the data transferring capability (“bandwidth”) of the various communications paths.
Typically, a communications medium or “bus” for transferring data from one integrated circuit chip to another includes multiple parallel lines which carry data at a frequency corresponding to a bus clock signal, which may be generated by the transmitting chip, the receiving chip, or some third component. The multiple lines in parallel each carry a respective part of a logical data unit. For example, if eight lines carry data in parallel, a first line may carry a first bit of each successive 8-bit byte of data, a second line carry a second bit, and so forth. Thus, the signals from a single line in isolation are meaningless, and must somehow be combined with those of other lines to produce coherent data.
The increased clock frequencies of processors and other digital data components have induced designers to increase the speeds of bus clocks in order to prevent transmission buses from becoming a bottleneck to performance. This has caused various design changes to the buses themselves. For example, a high-speed bus is typically implemented as a point-to-point link containing multiple lines in parallel, each carrying data from a single transmitting chip to a single receiving chip, in order to support operation at higher bus clock speeds.
The geometry, design constraints, and manufacturing tolerances of integrated circuit chips and the circuit cards or other platforms on which they are mounted makes it impossible to guarantee that all lines of single link are identical. For example, it is sometimes necessary for a link to turn a corner, meaning that the lines on the outside edge of the corner will be physically longer than those on the inside edge. Circuitry on a circuit card is often arranged in layers; some lines may lie adjacent to different circuit structures in neighboring layers, which can affect stray capacitance in the lines. Any of numerous variations during manufacture may cause some lines to be narrower than others, closer to adjacent circuit layers, etc. These and other variations affect the time it takes a signal to propagate from the transmitting chip to the receiving chip, so that some data signals carried on some lines will arrive in the receiving chip before others (a phenomenon referred to as data skew). Furthermore, manufacturing variations in the transmitter driving circuitry in the transmitting chip or receiving circuitry in the receiving chip can affect the quality of the data signal.
In order to support data transfer at high bus clock speeds, the lines of a data communications bus can be individually calibrated to compensate for these and other variations. However, so sensitive is the communications mechanism in many modern data processing environments that calibration parameters can drift significantly during operation, so that periodic re-calibration is required to achieve acceptable performance.
Modern data processing systems are expected to provide a high degree of availability, and interruption of data processing function to perform system maintenance is increasingly unacceptable. Accordingly, various techniques exist whereby a data communications bus can be periodically re-calibrated without suspending operation of the bus, i.e. without suspending the transfer of functional data. For example, it is known to provide a duplicate of each individual line and certain associated hardware for use in calibrating the line, so that functional data can be transmitted on the duplicate line while the primary line is being calibrated. It is also known to provide a common redundant line for use in calibration, the individual lines being calibrated one at a time, while the common redundant line compensates for the lost data capacity of the line being calibrated.
As bus speeds and other design parameters become increasingly more demanding, it becomes more and more difficult to ensure that all lines of a communications bus can function properly at all times, even with periodic dynamic re-calibration of the lines.
In general, it is known to provide redundant components of a data processing system, for use in the event of a failure of any single component. However, component failure may still affect system availability and performance during a time period in which a failure is identified, and a redundant component brought into operation. In the case of a communications line, a line failure may manifest itself as a sudden and complete failure of the line to transmit intelligible data (often known as a “hard error” or “hard failure”), but often manifests itself instead as an unacceptably high rate of intermittent error (often referred to as “soft error”), which may gradually grow worse over time. A hard failure in a communications line can cause serious system disruption and should be rectified as quickly as possible. High soft error rates may significantly degrade system performance as a result of the need to re-try operations.
In order to support continuing increases in communications bus speeds and improved system reliability and availability, a need exists for improved techniques to detecting and responding to communications bus component faults. In particular, it would be desirable to early detect an increase or a potential increase in soft error rate in a communications line, and ideally before the soft error rate becomes sufficiently high to significantly affect performance. It would further be desirable to bring a replacement line to an operational state sooner than is typical using conventional techniques to avoid or minimize system disruption resulting from hard failures or high soft error rates.