Most system designs are represented by a model written in some hardware description language ("HDL") that can be later transformed into silicon. The pre-silicon model is extensively verified through simulation before it is fabricated ("taped-out"). Since the fabrication process is very expensive, it is necessary to keep the number of tape-outs to a minimum by exposing all bugs either in simulation or in early releases of the hardware. While software simulators are slow, they permit unrestricted use of checker probes into the model-under-test. As a result, any violation exposed during simulation can be detected via the probes. On the other hand, hardware exercise programs can run at a very high speed but their checking abilities are limited to the data observed in the testcase.
Various testing methods and background information is found in A. Saha, N. Malik, J. Lin, C. Lockett and C. G. Ward, "Test floor Verification of Multiprocessor Hardware", IPCCC 1996; IBM, "PowerPC Architecture", Morgan Kaufman Publishers, 1993; L. Lamport, "How to make a multiprocessor computer that correctly executes multiprocessor programs", IEEE Transaction on Computers, September 1979; W. W. Collier, "Reasoning about Parallel Architectures", Prentice-Hall Inc, 1990; Kourosh Gharachorloo, et. al, "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors", Proc. 17th Annual Symposium on Computer Architecture, May 1990; Janice Stone and Robert Fitzgerald, "An Overview of Storage in PowerPC", Technical Report, IBM T. J. Watson Research Center, February 1993; A. Saha, N. Malik, B. O'krafka, J. Lin, R. Raghavan and U. Shamsi, "A Simulation Based Approach to Architectural Verification of Multiprocessor Systems", IPCCC 1995; J. T. Yen, et. al, "Overview of PowerPC 620 Multiprocessor Verification Strategy", Proc. International Test Conference, 1995; D. T. Marr, et. al, "Multiprocessor Validation of the Pentium Pro", IEEE Computer Magazine, November 1996; G. Cai, "Architectural and Multiprocessor Design Verification of the PowerPC 604 Data Cache", IPCCC 1995; and Ram Raghavan et al., "Multiprocessor System Verification through Behavioral Modeling and Simulation", IPCCC 1995. However, present methods suffer from a variety of drawbacks which will be described in greater detail herein.
Below is a set of definitions used throughout this application:
True Sharing: When two (or more) processors compete to access the same location within a cache block, the accesses are said to be true sharing. The outcome of these accesses is non-deterministic until runtime.
False-Sharing: When two (or more) processors access different locations within a cache block, the accesses are said to be false sharing. The outcome of these accesses is deterministic and can be computed a priori by sequentially running each processor's test stream on a uni-processor.
Non-Sharing: When two (or more) processors access different locations in different cache blocks, the accesses are called non-sharing accesses. The outcome of these accesses is deterministic and can be computed much the same way as the false sharing case.
Barrier: A barrier is a section of code (written using synchronization primitives) placed within each participating processor's stream. Its purpose is to ensure no participating processor continues past it until all participating processors have reached it in their respective streams. Also, when a processor reaches a barrier, all storage accesses initiated prior to the barrier must be performed with respect to all the other processors.
Every architecture defines ordering rules for storage accesses to memory locations. The most restrictive form of ordering, Sequential Consistency, limits the performance of programs by requiring all storage accesses to be strictly ordered. Several new techniques, like weak-ordering, have relaxed this requirement such that, under certain conditions, storage accesses may execute out-of-order. Any required ordering is enforced by synchronization primitives which are an integral part of these architectures.
Thus, it is important to understand some weak-ordering rules which are provided below:
Rule 1: dependent storage accesses from a processor must perform in order, and all non-dependent accesses may perform out-of-order. However, in the presence of synchronization primitives, all storage accesses initiated prior to the synchronization primitive must perform before it performs, and all storage accesses initiated after a synchronization primitive must perform after it performs. By dependent, it is meant that these accesses are to the same location (address dependency) or there is some explicit register-dependency among them.
Rule 2: accesses to the same location are said to compete with each other when at least one of them is a store operation. Competing storage accesses from different processors can perform in any order. As a result, these accesses must be made non-competing by enclosing them within critical sections which are governed by lock and unlock routines.
Rule 3: all accesses to a particular location are coherent if all stores to the same location are serialized in some order and no processor can observe any subset of those stores in a conflicting order. That is, a processor can never load a "new" value first and later load an "older" value.
A commonly used hardware exerciser is shown in FIG. 1. The hardware exerciser 100 is a program consisting of a Random Test Generator ("RTG") 102 and a simple Functional Simulator 108. It can be executed on either the hardware-under-test ("HUT") 104 or on another machine known to function correctly.
The RTG 102 produces random streams of storage access instructions for each processor in the system. Due to buffering delays, the global order of storage accesses from different processors to a particular location is non-deterministic.
Consequently, when two processors compete (as defined in Rule 2), the load operation may read the value held before the store operation or the value written by the store operation. This asynchronous nature of storage access ordering restricts the testcases generated by the RTG 102 to be false or non-shared such that the expected results are deterministic.
The Functional Simulator 108 is a simple reference model of the architecture and not the actual system. Given a multiprocessor ("MP") testcase, it computes a set of deterministic expected values. Since the storage accesses are falsed/non-shared, the expected results can be computed by sequentially executing each processor's stream on the Functional Simulator 108. After computing the expected results, the MP testcase is loaded on the HUT 104 and executed. When the testcase completes, the expected values from the Functional Simulator 108 are compared by the comparator 110 in the checker 106 with the ones obtained from the actual MP test run. If a mismatch occurs, a violation has been detected.
A variation of the above exerciser supports some restricted true sharing with the use of barriers. If two processors execute competing accesses (true sharing), the RTG 102 identifies these accesses and partitions them across barriers. In the example shown in FIG. 2A, two processors P0 and P1, compete for memory locations A and B. In FIG. 2B, the code sequence is modified so that the accesses to A and B are serialized using barriers.
If the stream between two barriers is defined as a barrier-window, only one processor is allowed to access a particular location within a barrier-window. As a result, the accesses to a location across all processors are the same as sequentially executing (in the order of the barriers) these accesses from a uni-processor system. The expected result for each location is the final value held by the location at the end of the test. A miscompare between the actual and the expected values indicates a violation in either a barrier-window or the barrier function itself.
The importance of true sharing can be understood from Rule 3. In the hardware exerciser described above, the use of barriers to accommodate true sharing restricts false, non-sharing and true sharing storage accesses to only execute within a barrier window. Consequently, bugs that may be exposed under unrestricted conditions will escape detection. In addition, lock and barrier functions themselves perform unrestricted true sharing accesses; as a result, the underlying instructions which implement these functions must be verified under unrestricted conditions.
In addition to the inability to verify true sharing accesses, the hardware exerciser shown in FIG. 1 is only able to perform checks on the values held in the various locations at the end of the testcase. Consequently, it is possible for the HUT 104 to correctly complete the testcase (correct expected values) but still contain violations that escaped detection. In order to detect these escapes, several billion cycles and sequences may have to be run such that these violations have a chance to propagate to the end of the testcase.
In weakly-ordered architectures, synchronization primitives are extensively used to implement control over accesses to shared data. The barrier function, which controls accesses to true sharing locations, is implemented using synchronization primitives. While the correct operation of the barrier function is itself a check, it is not the only mechanism to verify synchronization primitives.