Increasingly, state-of-the-art computer applications implement high-end tasks that require multiple processors for efficient execution. Multiprocessor systems allow parallel execution of multiple tasks on two or more central processor units ("CPUs"). A typical multiprocessor system may be, for example, a network server. Preferably, a multiprocessor system is built using widely available commodity components, such as the Intel Pentium Pro.TM. processor (also called the "P6" processor), PCI I/O chipsets, P6 bus topology, and standard memory modules, such as SIMMs and DIMMs. There are numerous well-known multiprocessor system architectures, including symmetrical multiprocessing ("SMP"), non-uniform memory access ("NUMA"), cache-coherent NUMA ("CC-NUMA"), clustered computing, and massively parallel processing ("MPP").
A symmetrical multiprocessing ("SMP") system contains two or more identical processors that independently process as "peers" (i.e., no master/slave processing). Each of the processors (or CPUs) in an SMP system has equal access to the resources of the system, including memory access. A NUMA system contains two or more equal processors that have unequal access to memory. NUMA encompasses several different architectures that can be grouped together because of their non-uniform memory access latency, including replicated memory cluster ("RMC"), MPP, and CC-NUMA. In a NUMA system, memory is usually divided into local memories, which are placed close to processors, and remote memories, which are not close to a processor or processor cluster. Shared memories may be allocated into one of the local memories or distributed between two or more local memories. In a CC-NUMA system, multiple processors in a single node share a single memory and cache coherency is maintained using hardware techniques. Unlike an SMP node, however, a CC-NUMA system uses a directory-based coherency scheme, rather than a snoopy bus, to maintain coherency across all of the processors. RMC and MPP have multiple nodes or clusters and maintain coherency through software techniques. RMC and MPP may be described as NUMA architectures because of the unequal memory latencies associated with software coherency between nodes.
All of the above-described multiprocessor architectures require some type of cache coherence apparatus, whether implemented in hardware or in software. High speed CPUs, such as the P6 processor, utilize an internal cache and, typically, an external cache to maximize the CPU speed. Because a SMP system usually operates only one copy of the operating system, the interoperation of the CPUs and memory must maintain data coherency. In this context, coherency means that, at any one time, there is but a single valid value for each datum. It is therefore necessary to maintain coherency between the CPU caches and main memory.
One popular coherency technique uses a "snoopy bus." Each processor maintains its own local cache and "snoops" on the bus to look for read and write operations between other processors and main memory that may affect the contents of its own cache. If a first processor attempts to access a datum in main memory that a second processor has modified and is holding in its cache, the second processor will interrupt the memory access of the first processor and write the contents of its cache into memory. Then, all other snooping processors on the bus, including the first processor, will see the write operation occur on the bus and update their cache state information to maintain coherency.
Another popular coherency technique is "directory-based cache coherency." Directory-based caching keeps a record of the state and location of every block of data in main memory. For every shareable memory address line, there is a presence bit for each coherent processor cache in the system. Whenever a processor requests a line of data from memory for its cache, the presence bit for that cache in that memory line is set. Whenever one of the processors attempts to write to that memory line, the presence bits are used to invalidate the cache lines of all the caches that previously used that memory line. All of the presence bits for the memory line are then reset and the specific presence bit is set for the processor that is writing to the memory line. Therefore, all of the processors do not have to reside on a common snoop bus because the directory maintains coherency for the individual processors.
In a typical multiprocessor architecture, the processors are interfaced to main memory by means of a memory controller which arbitrates the memory requests of the competing processors and transfers data between main memory and the processors. Often, the memory controller also interfaces a plurality of I/O devices to main memory and to the processors. For example, the memory controller may be coupled to a plurality of I/O bridges by means of an I/O bus. The I/O bridges provide an interface between the I/O bus and one or more external subsystems on the other side of the I/O bridge. In an exemplary system, the external I/O subsystems may be PCI subsystems, such as video cards and SCSI adapters and the like, that are plugged into expansion card slots.
The memory controller propagates requests to the I/O subsystem from the CPU bus down to the I/O bus, thereby allowing data to be sent to, or received from, an external I/O subsystem. Under normal circumstances, the I/O bridge accepts CPU-issued I/O requests that target the I/O subsystem and completes them in order. However, it is possible for the I/O bridge to retry outbound CPU I/O requests due to I/O traffic. In these cases, the I/O bridge begins by opening a slot for the outbound CPU cycles. When multiple CPUs are contending to access the same I/O bridge, it is possible for one or more CPU I/O requests to repeatedly steal the slot opened up by the I/O bridge for a previous, different CPU I/O request. Contention between incoming I/O transactions and the ordering requirement for both outgoing and incoming transactions requires the I/O bridge to retry outbound CPU-issued I/O requests until a slot can be made by throttling the inbound I/O transfers. If there are a sufficient number of CPU-issued I/O requests competing with sufficient incoming I/O transactions, both performance and fairness can be degraded, possibly even to the extent of causing an indefinite starvation due to a lack of forward progress, which could eventually cause a multiprocessor system crash due to these live lock conditions.
Therefore, there is a need in the art for improved multiprocessor systems that maintain forward progress from the CPU to the I/O bridge until a possible contention condition is detected. In particular there is a need in the art for systems, circuits, and methods that are able to track the number of times a CPU-issued I/O request to an output I/O bridge is retried and force the I/O bridge to accept a CPU I/O request that has been repeatedly retried.