1. Field of Invention
The present invention relates in general to the digital data processing field. More particularly, the present invention relates to testing, characterizing and monitoring chip interfaces for communicating data between integrated circuit chips.
2. Background Art
In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications buses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer's hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data form one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore, continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the throughput) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor(s). For example, if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrating circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of circuits, and by various other techniques. For example, it has been possible to increase the width of various data buses to transfer more data with each bus cycle. This is true for serial links as well. Additional improvements have been made possible by increasing parallelism, and specifically, by employing multiple processors. The modest cost of individual processors packaged on integrated circuit chips has made multi-processor systems practical, although such multiple processors add more layers of complexity to a system.
These various development trends in the design of computers and other digital data processing devices have had the effect of increasing the number of I/O pin on integrated circuit chips, and consequently, of increasing the number of data signal lines which communicate data from one chip to another over buses and links.
Inter-chip communication lines are often arranged as buses or links having a defined protocol associated with a clock signal. To maximize overall system throughput, a fast clock is desirable, and designers are often pushing the envelope to obtain the fastest possible clock speed that the hardware will allow.
Where numerous inter-chip communication lines connect multiple ports on multiple chips, all simultaneously communicating data, various factors can degrade the signals received by a receiving chip. This degradation is not uniform. Each line has a different physical location on a circuit card, cable, or other carrier, it will lie adjacent a unique set of signal lines, and will have a unique driver circuit. It is therefore inevitable that some lines will exhibit a greater signal degradation, and will be more prone to soft (i.e., intermittent, non-repeatable) errors, than others.
In general, signal degradation may take the form of a phase shift, a voltage shift, or a purely random signal variation. A phase shift in the signal received by the receiving chip may be caused, for example, by capacitance in the inter-chip single line. Since each line has a different physical lay-out, the line capacitance (and the amount of phase shift) will vary with each line. A phase shift may also be caused by a clock skew within the transmitting chip, which again might vary from chip to chip. A voltage shift may be caused, for example, by power variations in the transmitting chip's drivers. Random variations in the signal have many possible causes too numerous to mention, but in general certain conditions make a line more susceptible to variation, such as physical length of the line, the strength of the drivers, etc. It will therefore be expected that some lines exhibit more random variation than others.
At some point, the clock signal regulating a chip-to-chip interface can be made to run slow enough so that variations in phase shift of the signal from line to line will not cause problems for the receiver. However, slowing the clock signal to accommodate the worst case signal line may adversely affect system throughput. As the number of signal lines increases, the variation of the worst case line is likely to be more extreme, requiring further slowing of the clock.
Historically, interfaces are often characterized and debugged in the development process using special test equipment, which is coupled to the signal lines to observe what is happening. As the number and density of lines increases, connecting test probes to individual lines is increasingly difficult and time consuming. Additionally, any test probe and attached apparatus has some finite impedance, which can distort the single being characterized. When transmission errors are of an intermittent and infrequent nature, such conventional testing equipment is often inadequate to the task of characterizing the interface.
Recently, some chips have been designed with “elastic” interfaces, in which the timing of the individual signal line receiver circuits can be varied to accommodate the individual variations in the line capacitance and so forth. These elastic interfaces are typically tuned during system design by measuring phase skew of the lines. Tuning elastic interfaces according to these techniques is a time-consuming process, which only promises to become more difficult as the number of inter-chip communication lines increases in future designs. Moreover, such tuning does not necessarily take other signal degrading factors into account, including signal degrading factors that occur in the field when the system is installed and placed in service after having been designed, and does not necessarily obtain optimal results.
More recently, chips have been designed with elastic inter-chip interfaces that can be tuned on an individual line basis to reduce errors by altering threshold detection voltage, signal delay, and/or driver power. In U.S. Pat. No. 6,735,543, for example, a tunable and characterizable interface is disclosed that supports the accurate determination of low error rates on an individual line basis for various tuning parameter settings. In that patent, an inter-chip transmission circuit in a transmitting chip and a complementary receiving circuit in a receiving chip are disclosed that provide the capability to characterize the inter-chip interface by separately generating identical pseudo-random test data at both chips, comparing the data, and recording errors.
In a multi-node system that uses serial links to communicate and maintain synchronization of tightly coupled processors, such as those used in symmetric multiprocessing (SMP), the amount of data going through the links is large. These links typically connect expansion ports, such as SMP expansion ports and remote expansion enclosure (RXE) expansion ports, using cables that incorporate multiple communication lines. The performance and serviceability of such systems hinge on the ability to maintain low error rates and detect errors and failures in these links. These links are very sensitive to voltage differences between links, cable length, attenuation, impedance matching, etc. Fluctuations in board impedance, voltage regulators within circuit boards, cables and other factors also impact the performance of the link (also referred to herein as “link errors”). These fluctuations and the sensitivity of the links can create problems when designing, manufacturing, installing and using multi-node systems. Moreover, link errors are also impacted by the type of data being sent across the link. For example, link errors vary as a function of sending random data versus pattern data, or sending data associated with different application programs. Consequently, merely checking for errors in the link using one type of data, e.g., using random data, is not a good predictor of the performance of the link for other types of data. For example, if data associated with an application program being run by the system is transmitted across the link instead of random data, link errors could increase or decrease. Similarly, if the application program being run by the system changes, link errors could increase or decrease.
Two conventional techniques that are used to determine the performance of a link are the “ping” test and the construction of a “bathtub curve”. These techniques are used in system design, manufacture, and in the field. The ping test sends large amounts of random data across a link and performance counters in the receiving chip calculate the performance of the link, e.g., an error rate of the link. In the ping test, a linear feedback shift register (LSFR) in the transmitting chip is typically used to generate pseudo-random data. This data is transmitted in a packet having a header, data, trailer, and an appended cyclic redundancy check (CRC). The header contains information such as the type of packet, the length of the packet, etc. The receiver circuit in the receiver chip checks the header to determine the length of the packet, runs through the CRC algorithm, and compares the result with the CRC appended to the trailer. If the comparison results in a match, then the packet was transmitted and received successfully. If the comparison results in a miscompare, then the packet was not transmitted and received successfully and an error is logged. The packets are repeatedly transmitted across the link to determine an error rate.
A problem with the ping test is that it is a point test, i.e., tuning parameters such as signal delay are not varied. Accordingly, the ping test does not provide guardband data (also referred to herein as “margin” data) indicative of the performance of the link with respect to varied parameters which may change over the life of the system. In addition, because the ping test checks for errors in the link using one type of data, i.e., using random data, the ping test in not a good predictor of the performance of the link for other types of data. For example, if data associated with an application program subsequently run by the system is transmitted across the link instead of random data, the error rate of the link could increase or decrease.
The second conventional technique that is used to determine the performance of a link is the construction of a bathtub curve. In essence, this technique is performed by running the ping test multiple times while a signal delay is changed in a step-wise fashion. During a bathtub curve test, the signal delay is shifted in increments and/or decrements using a programmable delay line in the data path within the receiver circuit of the receiver chip. A bathtub curve is constructed by plotting the error rate as a function of signal delay. An important advantage of constructing a bathtub curve is that it provides a “window of guardband” (also referred to herein as “timing margin” and “eye opening”) in the results. A window of guardband is the width of the bathtub curve at a given error rate. However, running the bathtub curve test requires bringing the system down because clocks must be stopped in the chips to read and set data. In addition, because construction of a bathtub curve according to current techniques checks for errors in the link using one type of data, i.e., using random data, the ping test is not a good predictor of the performance of the link for other types of data. For example, if data associated with an application program subsequently run by the system is transmitted across the link instead of random data, the error rate of the link could increase or decrease.
In order to design and produce systems of increased complexity, and in particular, multi-node systems that use links having an increased number of communication lines operating at high clock rates, it would be desirable to provide improved techniques for testing, characterizing and monitoring the communication lines that reflect timing margin while the system is running an application program and transmitting data associated with the application program across the communication lines.