1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of testing hardware components of a data processing system such as a processing unit.
2. Description of the Related Art
The basic structure of a conventional symmetric multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system bus 16. System bus 16 is connected to one or more service processors 18a, 18b, a memory controller 30, and various peripheral devices 22. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
Memory controller 30 is further connected to a system memory device 20. System memory device 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. The PCI host bridge interconnecting peripherals 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device. The service processors can alternately reside in a modified PCI slot which includes a direct memory access (DMA) path.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processing unit includes the POWER5™ processor marketed by International Business Machines Corp. which comprises a single integrated circuit (IC) superscalar microprocessor having various execution units, registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b includes an on-board (L1) cache (typically, separate instruction and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache such as a second level (L2) cache 28 which supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via system bus 16. Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 4 megabytes, and L3 cache 32 might have a storage capacity of 32 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board, pluggable module, or similar field replaceable unit (FRU), which can be easily swapped, installed in, or swapped out of system 10 in a modular fashion.
The control logic for various components of the memory hierarchy may include error correction code (ECC) circuits to handle errors that arise in a cache line. A bit in a given cache block may contain an incorrect value either due to a soft error (such as stray radiation or electrostatic discharge) or to a hard error (a defective cell). ECCs can be used to reconstruct the proper data stream. Some ECCs can only be used to detect and correct single-bit errors, i.e., if two or more bits in a particular block are invalid, then the ECC might not be able to determine what the proper data stream should actually be, but at least the failure can be detected. Other ECCs are more sophisticated and even allow detection or correction of multi-bit errors. These latter errors are costly to correct, but the design tradeoff is to halt the machine when double-bit (uncorrectable) errors occur.
When an IC chip is fabricated for a computer component such as a processing unit or cache memory, it can be evaluated using different testing techniques such as a wafer-level test or an automatic built-in self test (ABIST) to determine if there are any defective logic or storage cells. If a chip fails a wafer-level test, the part may be scrapped. If the chip passes the wafer-level test, the ABIST engine can be activated to perform a nonfunctional test. If a defective cell is found, various corrective measures can be taken. The chip may be repairable for example by setting a fuse which is indicative of the defective cell and redirects signals to another (redundant) cell. If the defect is not correctable, the part may again be scrapped. Testing is also useful in providing an analysis of the chip, to better understand how to improve the IC design and avoid device failures.
One problem with the foregoing tests is that they often do not detect marginal defects in a chip which may only cause problems during actual use of the computer system, because the wafer-level and ABIST procedures do not carry out functional testing of the system (as it would operate under general conditions). Those techniques are typically limited to nonfunctional testing of a single chip using test registers and scan procedures such as a level-sensitive scan design (LSSD). It accordingly becomes necessary to carry out costly bench testing with most of the system components installed, in order to accurately generate operational parameters such as bus traffic and chip noise associated with actual use of the system. In particular, complete functional testing of a processing unit requires interconnection with a memory device to generate “real” conditions. It would, therefore, be desirable to devise an improved method for evaluating a component of a data processing system such as a processing unit which enabled accurate functional testing of the unit without requiring interconnection of the unit with other system devices. It would be further advantageous if the method could utilize existing hardware features in the component so as to reduce or minimize any additional overhead.