The present invention generally relates to multiprocessing computer systems, and more specifically to determining memory access delays in a CC-NUMA environment for use in exhaustively testing interactions among multiple tightly coupled processors.
The literature is full of examples where processor and system faults or xe2x80x9cbugsxe2x80x9d were discovered long after the processors or systems were shipped to customers. It is well known that the later in the product cycle that a xe2x80x9cbugxe2x80x9d is discovered, the greater the expense to fix it. Compounding this problem is the trend towards shorter and shorter product cycles. Finally, the problem is compounded again by the trend towards tightly-coupled multiple processor computer systems. This compounding is because in such a tightly-coupled multiple processor system, it is not only necessary to discover and fix the faults in a single processor, it is also now necessary to discover and fix faults resulting from the interaction among the multiple processors.
One problem with implementing tightly coupled multiple processor computer systems are in exhaustively testing the interactions between and among multiple processors. For example, in a tightly coupled system, two or more processors may each have an individual high-speed level one (L1) cache, and share a slightly lower speed level two (L2) cache. This L2 cache is traditionally backed by an even larger main memory. The L1 and L2 caches are typically comprised of high speed Static Random Access Memory (SRAM), and the main memory is typically comprised of slower speed Dynamic Random Access Memory (DRAM).
It is necessary that the cache and memory be maintained for coherency. Thus, for example, at most only a single L1 cache of a single processor is allowed to contain a cache line corresponding to a given block of main memory. When multiple processors are reading and writing the same block in memory, a conflict arises among their cache controllers. This is conflict is typically resolved in a tightly coupled multiprocessor system with an interprocessor cache protocol communicated over an interprocessor bus. For example, a first processor may be required to reserve a cache copy of the contested block of memory. This is communicated to the other processors. However, if another (second) processor already has reserved the contested block of memory, the first processor must wait until the block is unlocked, and potentially written at least back to the L2 cache.
Debugging a cache protocol can be quite difficult. This stems from a number of interrelated factors. First, the multiple processors are each typically operating asynchronously from each other at extremely high frequencies or rates of speed. Secondly, the L1 caches, and their cache controllers are typically operating at essentially the same speed as the processors. Third, instruction cache misses for test instruction sequences can delay instruction execution by relatively long, somewhat variable, periods of time. There are a number of reasons for this later problem. One reason is it may be possible to retrieve a cache line of instructions from L1 cache or from L2 cache, or it may be necessary to load the cache line from slower main memory. The DRAM comprising the main memory typically operates quite a bit slower than the processor (and L1 cache). Another problem is that the time it takes to fetch a block of instructions from the main memory may vary slightly. There are a number of causes of this. First, accessing different addresses in the DRAM may take slightly different times. This is partly because of differing signal path lengths. Secondly, different memory banks may have slightly different timing. This is true, even when the specifications for the memories are equivalent. This is particularly true, when the memories are self-timed. This problem may be accentuated when multiple processors or multiple memories share a common memory access bus, where the actions of one processor or memory may lock out, and stall, another processor or memory. Note also that asynchronous Input/Output (I/O) operations to memory can have seemingly random effects on timing.
Despite the problems described above, in order to effectively test the interaction among multiple processors, it is preferable to exhaustively test each set of possible combinations. In the case of a cache protocol as described above, it is preferable to exhaustively test each possible set of cache states and cache state transitions. It is also preferable to be able to detect and record state changes at a lower level than that available to a user program.
In order to test the interactions among multiple processors, the various combinations of states and state transitions should be tested. This can be done by executing programs simultaneously on each of the processors. Varying the time when each processor executes its program can test the different combinations. Unfortunately, there is no mechanism in the prior art to accurately exhaustively vary the times when each processor executes its program. This is partly due to the processor instruction timing variations described above. The result is that timing windows often arise where particular state and state transition interactions are not tested.
One solution to this problem is to increase the number of tests run and the number of test cycles run. This increases the chances of uncovering faults, but does not guarantee exhaustive fault coverage.
Another set of prior art solutions is to try to control more closely the timing between executions of programs by the multiple processors. One such solution is to use NOP instructions to delay execution. The larger the number of NOP instructions executed, the longer the delay. However, NOP instructions are typically executed out of blocks of instructions held in cache lines. Each time execution crosses a cache line boundary, there is a potential for a cache miss, resulting in retrieving the cache line from slower memory. There is also a potential at that point that execution may be delayed for one or more cycles due to memory bus contention. Each of these potential delays introduces a potential window that did not get tested utilizing this set of solutions. Note also that virtual memory program activity must also be accounted for.
Another problem that arises is that it is often hard to distinguish states and state transitions from a programmer""s view of a processor. This is partly because there is much that is not visible at this level. States and state transitions must therefore be assumed from visible programmer model level changes in the processor. This problem of distinguishing state and state transitions is a particular problem when the states and state transitions are cache states and state transitions during interaction testing among multiple processors.
One prior art solution to determining machine states and state transitions is through the use of SCAN. Using SCAN, a known pattern of states can be loaded into a processor. The processor then executes one or two instructions. The states of the various memory elements in the processor are then unloaded from the processor and compared with their expected values. This type of functional testing is becoming common for high-end microprocessors. Unfortunately, it does not lend itself to exhaustively testing the interactions among multiple processors. One reason for this is that a processor under the control of SCAN typically only executes for one or two instruction cycles, before the SCAN latches are unloaded, and another set of values loaded. The result of this is that SCAN is extremely slow, especially in comparison to the speed of modem processors. This significantly reduces the amount of testing that can be realistically done with SCAN. Secondly, there is no readily apparent mechanism available to test multiple processors at the same time, and more importantly to vary the start times of each of the multiple processors being tested together.
In the past, it has been sometimes been possible to run enough signals out of a processor that the states and state transitions being tested can be monitored by test equipment. One problem with this method of testing is that it is a manual and error prone process. Just as important, this method is fast becoming less and less possible as more and more functionality is embedded on single chips. Pin-count has become a major concern, and it has become increasingly unlikely that precious external pins can be dedicated for the sort of interprocessor state testing described above.
Testability, and thus reliability through earlier fault detection would be significantly increased in tightly coupled multiprocessor systems if the interactions among multiple processors could be accurately exhaustively tested, with the guarantee that no timing windows were inadvertently left untested. This testability would be further enhanced by a mechanism for recording states and state transitions over a series of clock cycles for each of the processors being tested.
One problem that arises when exhaustively testing the interactions among multiple processors occurs when it takes signals differing lengths of time to travel between various pairs of processors. This is the case in a Cache Coherent Non-Uniform Memory Architecture (CC-NUMA) such as where there are multiple processor modules, with each processor module containing multiple processors sharing a cache memory. These differing lengths of time can bias and interfere with the exhaustive testing of the interactions among multiple processors.
One solution would be to xe2x80x9chard codexe2x80x9d delay values depending on whether or not processors were in the same processor module, and thus shared a cache memory. Unfortunately, the actual delays tend to vary slightly between different computer systems, over time, and as technology changes. It would thus be advantageous to be able to utilize accurate intra-processor delay times when exhaustively testing the interactions among processors.