Modern computer systems including servers, workstations, and the like typically have several devices such as input/output ("I/O" or "IO") units that can cache lines of memory, or central processor units ("CPU's") or microprocessors, and associated distributed CPU random access memory units ("RAM"). (As used herein, the terms "CPU" and "device" shall be used interchangeably.) The various devices can communicate with each other and with RAM through one or more buses that can carry various information including requests, commands, memory addresses, data, and the like between the various devices. Such information is transmitted in packets over a bus line that is typically many bits wide, for example 64-bits or eight-bytes at a transmission rate that is affected by the system clock frequency.
The main memory of a computer system usually has large storage capacity and but is relatively slow in accessing data. To achieve faster access to data, and to access main memory less frequently many devices (and especially CPUs) have a small fast local memory, called a cache. The cache is used to store a copy of frequently and recently used data so that the device can access the cache instead of the main memory.
Several techniques are known in the art whereby a device writes data to a memory location that is in its cache. In a so-called "write through" cache, the data can be written to the cache as well as to main memory. Alternatively, in a "write back" (or "copy back") cache, the data can be written to the cache only. In write back cache, the data in main memory is "state" in that it is no longer correct, and only the cache holds a correct copy of the memory location. The modified copy of data in the cache is called "dirty". When the dirty data must be removed from the cache (to make room for a copy of a different memory location), the dirty data must be written back to memory. Although the present invention is described with respect to a computer system using write back caches, the invention could also be generalized for use with write through caches as well.
Understandably, cache coherence is important. If multiple devices have local copies of the same memory location in their caches, correct system operation dictates that all devices must observe the same data in their caches (since they are meant to hold copies of the same memory location). But if one or more devices write to their local copy of the data in their caches, all devices may no longer observe the same data. Cache coherence is the task of ensuring that all devices observe the same data in their caches for the same memory location. This is done by updating copies of the data in all other caches, or by deleting copies of the data in all other caches, when any device modifies data in its cache. Although the present invention is described with respect to use with a system using the second type of cache coherence, either type coherence may in fact be used. Note that if write back caches are used, when devices want a copy of a memory location that is dirty in another cache, the data must be obtained from the cache with the dirty data, and not from memory (since the data in memory is stale).
A so-called snooping protocol is a common technique for implementing cache coherence. Each cache maintains the state for each of the memory locations in the cache. When a device wishes to read or write a memory location, it broadcasts its request, usually over a bus. That request is observed and checked against the state by all devices, e.g., the request is "snooped". For read requests, caches with dirty copies respond with data rather than memory. For write requests, all other caches invalidate or update their copy of the data.
Transactions usually involve a request with the address followed by a response with the data. In so-called "circuit switched" buses, a transaction has to complete before a subsequent transaction can start. If there is a long delay between the request and the response, the bus remains idle for the duration of the delay, with resultant loss of bus bandwidth. By contrast, co-called "split transaction" (or "packet switched") buses allow requests and responses for other transactions in between the request and response for a given transaction. This allows the full bandwidth of the bus to be utilized, even if there are delays between the request and the response for a given transaction.
A CPU wishing to read data from a memory location, or to write data to a memory location typically will first broadcast a request-type signal to the system, over a system bus. However, other devices may also need to broadcast the same signal at the same time over the bus. But since only one signal value at a time may be transmitted on the bus, the devices must arbitrate for the use of the bus, and a mechanism implementing arbitration is provided. Further, the common system bus that carries these requests and the data and other signals is a finite resource, whose transmission bandwidth is fixed by the number of bit lines and system clock rate.
Several arbitration mechanisms are known in the art. In the so-called fair algorithm, the arbitrator grants bus access to CPUs in the order the requests arrive, e.g., access is granted to the CPU whose request has been pending the longest time. No priority of importance is assigned to the individual CPU requests, and the sole criterion is the time order of the various requests. Unfortunately, this algorithm requires substantial state and logic depth, and is difficult to implement.
Another prior art method is the so-called round robin algorithm, in which a cyclic order is defined among CPUs such that the identity of the most favored requestor-CPU moves. Thus, if CPU N received the most recent arbitration grant, then CPU N+1 has the highest priority if CPU N+1 asserts a request. While round-robin algorithms have found favor with computer system architects, such algorithms are difficult to implement with a small logic depth if the number of competitors is large. Further, a deep hierarchical level round-robin take too many clock cycles to determine the grant winner because winners must be determined at each of a plurality of lower levels, after which a winner is determined from among the lower level winners.
Yet another prior art algorithm prioritizes the various CPUs statically. Thus, CPU 0 is permanently assigned highest priority, CPU 1 is assigned the next highest priority, and so on. As a result, CPU 2 cannot receive an arbitration grant unless neither CPU 0 nor CPU 1 presently request bus access. This static prioritization scheme has the advantage of being especially easy to implement.
Using any of the above techniques, while the requesting CPU that receives a bus access grant from the arbitrator is gaining access, the requests from any other requesting
CPUs must wait in a blocked or pending state. The blocked or pending state continues until the requesting CPU receives its arbitration grant, places its data or desired address or other signal on the bus and thus completes its transaction.
In the prior art, regardless of the mechanism used to arbitrate competing requests for access to the bus, one arbitration line would be used for data, and a second arbitration line would be used for addresses. As noted, during the time of arbitration, grant, and access by the CPU winning grant access, other pending requests are temporarily blocked pending competition of the first granted request.
While such techniques work, the latency penalty can be excessive in that many clock cycles must pass between a first CPU request, a grant of arbitration to that CPU, CPU access to the bus, and then a grant of arbitration and bus access to the next requestor to receive bus access.
There is a need for a method and apparatus for arbitrating access to a bus in which blocking mechanisms are not used, and in which a minimal latency time is achieved.
The present invention provides such an arbitration method and apparatus.