The present invention is directed, in general, to multiprocessor systems and, more specifically, to a system and method for disrupting a lock-step sequence condition in a server containing multiple processor units.
Increasingly, state-of-the-art computer applications implement high-end tasks that require multiple processors for efficient execution. Multiprocessor systems allow parallel execution of multiple tasks on two or more central processor units (xe2x80x9cCPUsxe2x80x9d). A typical multiprocessor system may be, for example, a network server. Preferably, a multiprocessor system is built using widely available commodity components, such as the Intel Pentium(copyright)Pro processor (also called the xe2x80x9cP6xe2x80x9d processor), PCI I/O chipsets, Pentium(copyright)Pro processor bus topology, and standard memory modules, such as SIMMs and DIMMs. There are numerous well-known multiprocessor system architectures, including symmetrical multiprocessing (xe2x80x9cSMPxe2x80x9d), non-uniform memory access (xe2x80x9cNUMAxe2x80x9d), cache-coherent NUMA (xe2x80x9cCC-NUMAxe2x80x9d), clustered computing, and massively parallel processing (xe2x80x9cMPPxe2x80x9d).
A symmetrical multiprocessing (xe2x80x9cSMPxe2x80x9d) system contains two or more identical processors that independently process as xe2x80x9cpeersxe2x80x9d (i.e., no master/slave processing). Each of the processors (or CPUs) in an SMP system has equal access to the resources of the system, including memory access. A NUMA system contains two or more equal processors that have unequal access to memory. NUMA encompasses several different architectures that can be grouped together because of their non-uniform memory access latency, including replicated memory cluster (xe2x80x9cRMCxe2x80x9d), MPP, and CC-NUMA. In a NUMA system, memory is usually divided into local memories, which are placed close to processors, and remote memories, which are not close to a processor or processor cluster. Shared memories may be allocated into one of the local memories or distributed between two or more local memories. In a CC-NUMA system, multiple processors in a single node share a single memory and cache coherency is maintained using hardware techniques. Unlike an SMP node, however, a CC-NUMA system uses a directory-based coherency scheme, rather than a snoopy bus, to maintain coherency across all of the processors. RMC and MPP have multiple nodes or clusters and maintain coherency through software techniques. RMC and MPP may be described as NUMA architectures because of the unequal memory latencies associated with software coherency between nodes.
All of the above-described multiprocessor architectures require some type of cache coherence apparatus, whether implemented in hardware or in software. High speed CPUs, such as the Pentium(copyright)Pro processor, utilize an internal cache and, typically, an external cache to maximize the CPU efficiency. Because a SMP system usually operates only one copy of the operating system, the interoperation of the CPUs and memory must maintain data coherency. In this context, coherency means that, at any one time, there is but a single valid value for each datum. It is therefore necessary to maintain coherency between the CPU caches and main memory.
One popular coherency technique uses a xe2x80x9csnoopy bus.xe2x80x9d Each processor maintains its own local cache and xe2x80x9csnoopsxe2x80x9d on the bus to look for read and write operations between other processors and main memory that may affect the contents of its own cache. If a first processor attempts to access a datum in main memory that a second processor has modified and is holding in its cache, the second processor will interrupt the memory access of the first processor and write the contents of its cache into memory. Then, all other snooping processors on the bus, including the first processor, will see the write operation occur on the bus and update their cache state information to maintain coherency.
Another popular coherency technique is xe2x80x9cdirectory-based cache coherency.xe2x80x9d Directory-based caching keeps a record of the state and location of every block of data in main memory. For every shareable memory address line, there is a xe2x80x9cpresencexe2x80x9d bit for each coherent processor cache in the system. Whenever a processor requests a line of data from memory for its cache, the presence bit for that cache in that memory line is set. Whenever one of the processors attempts to write to that memory line, the presence bits are used to invalidate the cache lines of all the caches that previously used that memory line. All of the presence bits for the memory line are then reset and the specific presence bit is set for the processor that is writing to the memory line. Therefore, all of the processors do not have to reside on a common snoop bus because the directory maintains coherency for the individual processors.
From the foregoing description, it can be seen that from time to time, two or more processors will attempt to access data from the same location at the same time. In the normal operation of a multiprocessor system, this may result in one or more processors being xe2x80x9cretried.xe2x80x9d That is, a processor performs a memory access to a certain memory location and the memory access is denied because the memory location is temporarily unavailable. When this occurs, the processor retries the memory access within a very short period of time and usually succeeds in accessing the memory location during the retry.
It is known, however, that two or more processors may occasionally get trapped in an endlessly repeating cycle of retries that fails to ever access the desired memory location. This condition may be referred to as a xe2x80x9clock step sequence.xe2x80x9d The circumstances leading to a lock step sequence are complex and proving that a multiprocessor design is not susceptible to a lock step condition is difficult due to the design complexity and the number of possible states in the system. In its essentials, a lock step sequence may be recognized as a group of CPUs trying to access a line of data in memory that has been locked out by another CPU that has control over that line. Each of the locked out CPUs retries the line and fails, thereby causing another retry to be scheduled. The sequencing of the retries by the CPUs is such that the CPU that has actual control over the line is prevented from unlocking the line because the memory controller is always busy servicing the retry requests of the locked out CPUs.
In this situation, a great deal of bus traffic appears to be occurring, but no actual work is being accomplished by many, if not all, of the CPUs. The applications being run by the multiprocessor system are instead xe2x80x9cfrozenxe2x80x9d in place. As noted before, this condition is difficult to reproduce and correct due to the complexity of the timing of memory requests that cause the condition. The result is that many types of multiprocessor systems will from time to time lock up and require operator intervention to clear the condition. This causes much frustration and reduces the overall processing efficiency of the system.
Therefore, there is a need in the art for improved multiprocessor systems that can more effectively avoid intermittent frozen states that result from lock step sequences among two or more processors. In particular there is a need in the art for systems, circuits, and methods that are able to clear a lock step condition within a relatively short time period and without the need for operator intervention.
The lock-step sequence problems inherent in the prior art are overcome by the present invention. In one embodiment of the present invention, a control circuit is provided for use in a processing system containing a plurality of processors coupled to a main memory by a first common bus, wherein the control circuit perturbs a lock-step sequence of memory requests received from the processors. The control circuit comprises a memory request generator, adapted to be coupled to the first common bus, for generating at least one memory request operable to terminate the lock-step sequence of memory requests.
In one embodiment of the present invention, the at least one memory request is generated pseudo-randomly. In another embodiment of the present invention, a duration of the at least one memory request is generated pseudo-randomly. In yet another embodiment, such a memory request, or a duration of such memory request, is generated other than pseudo-randomly.
In other embodiments of the present invention, wherein the processing system further comprises a plurality of I/O devices coupled to the main memory by a second common bus, the memory request generator is adapted to be coupled to the second common bus and further generates at least one memory request on the second common bus operable to terminate a second lock-step sequence of memory requests received from the I/O devices.
In still other embodiments of the present invention, the at least one memory request on the second common bus is generated pseudo-randomly. In further embodiments of the present invention, a duration of the at least one memory request on the second common bus is generated pseudo-randomly. Again, such a memory request, or a duration of such memory request, may also be generated other than pseudo-randomly.
In alternate embodiments of the present invention, the at least one memory request on the first common bus and the at least one memory request on the second common bus are generated simultaneously. In still other embodiments of the present invention, the at least one memory request on the first common bus and the at least one memory request on the second common bus are generated at different times.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. For instance, the foregoing functionality may certainly be implemented in software, hardware, firmware, or some suitable combination of at least two of the same. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.