Modern computer systems including servers, workstations, and the like typically have several devices such as input/output ("I/O" or "IO") units that can cache lines of memory, or central processor units ("CPU's") or microprocessors, and associated distributed CPU random access memory units ("RAM"). (As used herein, the terms "CPU" and "device" shall be used interchangeably.) The various devices can communicate with each other and with RAM through one or more buses that can carry various information including requests, commands, memory addresses, data, and the like between the various devices. Such information is transmitted in packets over a bus line that is typically many bits wide, for example 64-bits or eight-bytes at a transmission rate that is affected by the system clock frequency.
The main memory of a computer system usually has large storage capacity and but is relatively slow in accessing data. To achieve faster access to data, and to access main memory less frequently many devices (and especially CPUs) have a small fast local memory, called a cache. The cache is used to store a copy of frequently and recently used data so that the device can access the cache instead of the main memory.
Several techniques are known in the art whereby a device writes data to a memory location that is in its cache. In a so-called "write through" cache, the data can be written to the cache as well as to main memory. Alternatively, in a "write back" (or "copy back") cache, the data can be written to the cache only. In write back cache, the data in main memory is "state" in that it is no longer correct, and only the cache holds a correct copy of the memory location. The modified copy of data in the cache is called "dirty". When the dirty data must be removed from the cache (to make room for a copy of a different memory location), the dirty data must be written back to memory. Although the present invention is described with respect to a computer system using write back caches, the invention could also be generalized for use with write through caches as well.
Understandably, cache coherence is important. If multiple devices have local copies of the same memory location in their caches, correct system operation dictates that all devices must observe the same data in their caches (since they are meant to hold copies of the same memory location). But if one or more devices write to their local copy of the data in their caches, all devices may no longer observe the same data. Cache coherence is the task of ensuring that all devices observe the same data in their caches for the same memory location. This is done by updating copies of the data in all other caches, or by deleting copies of the data in all other caches, when any device modifies data in its cache. Although the present invention is described with respect to use with a system using the second type of cache coherence, either type coherence may in fact be used. Note that if write back caches are used, when devices want a copy of a memory location that is dirty in another cache, the data must be obtained from the cache with the dirty data, and not from memory (since the data in memory is stale).
A so-called snooping protocol is a common technique for implementing cache coherence. Each cache maintains the state for each of the memory locations in the cache. When a device wishes to read or write a memory location, it broadcasts its request, usually over a bus. That request is observed and checked against the state by all devices, e.g., the request is "snooped". For read requests, caches with dirty copies respond with data rather than memory. For write requests, all other caches invalidate or update their copy of the data.
Transactions usually involve a request with the address followed by a response with the data. In so-called "circuit switched" buses, a transaction has to complete before a subsequent transaction can start. If there is a long delay between the request and the response, the bus remains idle for the duration of the delay, with resultant loss of bus bandwidth. By contrast, co-called "split transaction" (or "packet switched") buses allow requests and responses for other transactions in between the request and response for a given transaction. This allows the full bandwidth of the bus to be utilized, even if there are delays between the request and the response for a given transaction.
A CPU wishing to read data from a memory location, or to write data to a memory location typically will first broadcast a request-type signal to the system, over a system bus. However, other devices may also need to broadcast the same signal at the same time over the bus. But since only one signal value at a time may be transmitted on the bus, the devices must arbitrate for the use of the bus, and a mechanism implementing such arbitration is provided. Further, the common system bus that carries these requests and the data and other signals is a finite resource, whose transmission bandwidth is fixed by the number of bit lines and system clock rate.
The common system bus that carries these requests and the data and other signals is a finite resource, whose transmission bandwidth is fixed by the number of bit lines and system clock rate. Even with a rapid mechanism to arbitrate potentially conflicting requests and grant access requests, maximizing bus system throughput and response is a challenge. For example, prior art arbitration schemes impose a latency penalty of two clock cycles or more.
Prior art systems are complex due to the necessity of dealing with multiple transactions involving a common address. To reduce such ambiguities, such systems must define "pending" or "transient" states, which contributes further complexity to the overall implementation. Prior art attempts to impose flow control and avoid collision ambiguities in such systems are also cumbersome.
In some systems where a data request is not completed immediately following the request, complicated mechanisms must be employed to ensure that ultimately the request is completed. In a system in which memory is distributed, it is challenging to rapidly maintain a coherent domain, e.g., memory space that is always maintained coherent. A transaction request to read data from a memory location that presently holds what is invalid data cannot rapidly be completed in the prior art. First the memory location must be rewritten with valid data, and then the valid data can be provided to the requestor. Prior art procedures to implement these processes in a snooping split transaction bus system are complex and time consuming.
The architecture for a split transaction snooping bus system preferably should lend itself to use in a system requiring several such bus systems, a multiple workstation network, for example. In a computer system comprising a single bus system, the order in which transactions are placed on the address bus determines an absolutely temporal or time relationship. Thus, if a transaction initiated by CPU A appears on the bus before a transaction initiated by CPU B, the computer system irrevocably regards transaction A as preceding transaction B. Unfortunately, such simplistic assumptions no longer hold in a system that includes plurality of such computer systems, with a plurality of bus systems. One such example might be a network comprising at least two workstations.
In a sub-computer system having a single bus system, a unique order of transactions may be defined by the temporal order in which address packets appear on the address bus within the bus system. However in a system comprising a plurality of such sub-systems and having a plurality of bus systems, it is both necessary and extremely difficult to define a global order for transactions. For example, a CPU in sub-system 1 may wish to write data to a memory location that could be in any sub-system, including sub-system 1. At precisely the same time, a CPU in another sub-system might wish to write data to the same or another memory location. How then to define a global ordering between these two simultaneous transactions.
The resultant uncertainty can create problems in executing routines in which transaction order may be critical. Further, the inability to effectively define a global transaction order in the prior art for such systems can also result in system deadlock.
What is needed is a mechanism and procedure for such systems to create and optimize a global ordering of data replies.
The present invention provides such an procedure and mechanism for optimizing global data replies in such systems.