1. Field of the Invention
The present invention relates to diagnostic testing in multiprocessor computer systems and more particularly to an off-line multiprocessor cache coherency test for identifying CPU failures resulting in cache coherency errors.
2. State of the Art
The complex problem of maintaining cache coherency in a peer-to-peer multiprocessor computer system has been thoroughly studied, and various mechanisms for maintaining cache coherency have been proposed. Briefly stated, cache coherency requires that a read operation by any CPU in the system to a particular memory location return the latest value written to that memory location by any CPU in the system. An example of a known multiprocessor system is shown in FIGS. 1 and 2. Referring to FIG. 1, multiple CPUs, N in number, are connected to a common bus, (the I-BUS) and share a common main memory, also connected to the I-BUS. Each of the CPUs is therefore able to access main memory via the I-BUS. In addition, the CPUs are able to communicate among themselves across the I-BUS.
As seen in FIG. 2, each CPU is provided with a two-level cache including a secondary cache and a primary cache. The secondary cache is connected to the I-BUS so as to receive groups of data words (cache lines) from main memory or from other CPUs. Each CPU also includes a bank of registers and a PROM that serves as a control store. The register bank of each CPU is accessible by other CPUs across the I-BUS. The processor (in an exemplary embodiment, a MIPS R3000) is connected to the secondary cache, the register bank and PROM by an internal bus (the C-BUS). The processor is connected to the primary cache by a separate bus. The primary cache has an instruction cache (I-CACHE) portion and a data cache (D-CACHE) portion, not shown.
In operation, when the multiprocessor system is first brought up, none of the caches contains valid data. As data is retrieved from main memory, it is cached in the secondary cache. Data used with some level of frequency will remain in the secondary cache. Data used even more frequently is stored in the primary cache. Multiple copies of the same piece of data may reside in different caches. To ensure that only the most recently updated copy is used (cache coherency), status bits stored in cache indicate the status of each cache line. For example, a cache line may be indicated to be invalid, representing that data contained in the cache line has been updated by another CPU since the time the line was cached. The cache line may be indicated as having been modified, representing that the most recent update to the data was by the cache's own processor, but that the data has not yet been written back to main memory. The cache line may be indicated as being shared, representing that the data is valid but that at least one other CPU currently has a copy of the data. Finally, the cache line may be represented as being exclusive, in which case the data is unmodified and no other CPU currently has a copy of the data. Each of the CPUs continually monitors the I-BUS (called "snooping") to determine which memory locations have been accessed by the other CPUs. Cache states of the cache lines in each cache are changed as appropriate in accordance with known cache coherency mechanisms.
Cache coherency failure occurs when an access by CPU to a particular memory location returns a data value that is not the most recently updated data value. A major problem with cache-coherency failures is that the CPU which detects a failure is frequently not the CPU which caused the failure. This problem makes it very difficult to identify the failing CPU board. During manufacture of a multiprocessor computer system, circuit CPUs causing cache coherency failures must be identified and replaced in order to assure a properly working system. Furthermore, a need exists for a method of flushing out cache coherency design errors from an off-line environment during engineering development. Tests which run on-line are not suitable for engineering development. In the past, isolation of cache coherency failures has often required manually adding/removing CPU boards until the system is able to run successfully. This method is tedious, time-consuming and error-prone.
What is needed, then, is a method of rigorously testing cache coherency in a multiprocessor system and of automatically identifying the CPU(s) which cause cache coherency failure. A desirable approach to testing cache coherency is to create maximum cache coherency traffic on the shared bus in order to cause failures to surface. For rigorous testing, every possible sequence of cache-coherency bus operations should occur.