A computer system 10 illustrated in FIG. 1. The computer system 10 comprises processors or central processing units (CPUs) 12-1 to 12-n, cache memories 13-1 to 13-n, an I/O device 14, such as a disk memory, a system bus 16, an I/O bridge 17, a main memory 18 and a memory controller 19. Each of these devices 12-1 to 12-n, 13-1 to 13-n, 14, 15, 16, 17, 18 and 19 is explained below.
The main memory 18 is for storing data and is typically formed by one or more dynamic random access memory integrated circuits (DRAMs). Such DRAM main memories 18 are relatively inexpensive. The main memory 18 typically has a memory array of storage locations. Each storage location can store a data word of a fixed length, e.g., eight bit long or byte long data words. Each storage location has a unique identifier, called an address, which is used in data access, i.e., read and write, commands for specifying the particular storage location from which data should be read or into which data should be written. Illustratively, the storage locations are further organized into data line storage locations for storing fixed length (e.g., thirty-two byte long), non-overlapping, contiguous blocks of data called data lines. Each data line storage location has a unique line address, similar to the aforementioned addresses, for specifying a particular data line storage location in data line accesses (i.e., reading and writing data lines). The purpose of the data lines is described below.
The various devices 12-1 to 12-n, 13-1 to 13-n, 17 and 19 making up computer system 10 are interconnected by a system bus 16. The system bus 16 is for transferring information in the form of data, addresses and commands between devices. The system bus 16 can be viewed as comprising a set of wires that transmit signals which are interpreted as commands and data by the various devices. The system bus 16 includes a data bus for transferring data, a command bus for transferring commands and addresses, and an arbitration bus. The system bus 16 is a shared resource; each of the devices attempt to utilize the system bus at one time or another for purposes of transferring data, addresses or commands. Sometimes, more than one device contends to utilize the system bus 16 at the same time. However, only a limited number of processors illustratively can utilize the system bus 16 at one time. For instance, only one device may transmit data on the data bus at one time (although it may be possible for two devices to accept data from the data bus simultaneously). To resolve this contention, the computer system 10 is provided with an elaborate arbitration protocol for allocating the system bus 16, in a fair and orderly manner, to each device contending to use it. Illustratively, data is transmitted on the system bus 16 in 16 byte packets at a clock speed of 33 MHz.
The CPUs 12-1 to 12-n are for executing program instructions. Examples of instructions are: arithmetic or logical operations on data; program flow control instructions for ordering the execution of other instructions; and memory access commands. In the course of executing these instructions, the processors 12-1 to 12-n may issue data access, i.e., data read and data write, commands. The program instructions themselves are stored as data in the main memory 18. Illustratively, the CPUs may be microprocessors such as Intel.RTM.'s Pentium.TM. or Motorola.RTM.'s Power PC 604.TM. with a clock speed of up to 133 MHz.
The cache memories 13-1 to 13-n are small, relatively high speed memories for maintaining a duplicate copy of data stored in the shared main memory 18. Cache memories are typically formed by high speed static random access memory integrated circuits (SRAMs). For instance, note that the DRAMs of the main memory 18 may have a data access clock speed of no more than about 16.7 MHz whereas the SRAMs of the cache memories 13-1 to 13-n may have a data access clock speed of 30 MHz up to the clock speed of the CPU chips (assuming the cache memory 13-1 to 13-n is incorporated onto the same chip as the CPUs 12-1 to 12-n). The cache memories 13-1 to 13-n may also be part of the integrated circuit of the CPUs 12-1 to 12-n. As cache memory is considerably more expensive than the slower main memory, the size of each cache memory 13-1 to 13-n is typically much smaller than the size of the main memory 18. Despite their relatively small size in comparison to the main memory 18, the cache memories dramatically reduce the need to access data from the main memory 18. This is because cache memories 13-1 to 13-n exploit temporal and spatial locality of reference properties of processor data accesses. Temporal locality of reference is the tendency of processors 12-1 to 12-n to access the same data over and over again. The temporal property arises from program flow control instructions such as loops, branches and subroutines which cause the processors 12-1 to 12-n to repeat execution of certain recently executed instructions. Spatial locality of reference refers to the tendency of processors to access data having addresses near the addresses of other recently accessed data. The spatial property arises from the sequential nature of program instruction execution, i.e., the processor tends to execute instructions in the sequential order in which they are stored as data. In addition, memory references to non-instruction data tend to be localized to a lesser degree. For instance, non-instruction data tends to be stored in tables, arrays and frequently and repeatedly accessed variables. Thus, the CPUs 12 tend to access repeatedly the data stored in the same localities in memory.
In order to exploit the locality of reference property, cache memories typically store an entire data line corresponding to a recently accessed data. Consequently, the likelihood increases that the cache memories 13-1 to 13-n can satisfy future accesses to data not yet accessed (assuming that future accesses will be to other data corresponding to the data lines already stored in the cache memories 13-1 to 13-n).
The cache memories 13-1 to 13-n work as follows. When the corresponding processor, e.g., the processor 12-1, issues a data access command, the associated cache memory 13-1 determines if it possesses the data line that contains the particular desired data. If so, a read or write (depending upon whether the processor issued a read or write command) "hit" is said to occur and the cache memory 13-1 satisfies the processor data access using the copy of the data within the cache memory 13-1. If the cache memory 13-1 does not contain the designated data, a read or write "miss" is said to occur. In the event of a read or write miss, the cache memory 13-1 issues a command for reading the data line corresponding to the designated address from the main memory 18 via the system bus 16. In response to receiving the read command, the main memory 18 retrieves the data line stored therein at the particular line address and transfers this retrieved data line via the system bus 16 to the cache memory 13-1. The cache memory 13-1 stores the data line transferred from the main memory 18 and then continues as if the appropriate data line were already present in the cache memory 13-1.
Cache memories 13-1 to 13-n must maintain the consistency of the data in the main memory 18. That is, while a cache memory 13-1 to 13-n may modify its copy of the data, the counterpart copy of the cache memory's data in the main memory 18 must invariably be accordingly modified. According to one memory consistent manner of operating a cache memory (e.g., the cache memory 13-1) called write through, the cache memory 13-1 immediately attempts to update the counterpart copy in the main memory 18 whenever the processor 12-1 modifies the cache memory's 13-1 copy of the data. This manner of operating the cache memory 13-1 is disadvantageous because the cache memory 13-1 must continually use the system bus 16 to access the main memory 18 each time the associated processor 11-1 modifies the data.
In order to reduce the demands on the slow main memory 18 and system bus 16, the cache memories 13-1 to 13-n operate in a manner called "write back." According to this manner of operation, each cache memory 13-1 to 13-n defers updating or writing back the modified data line until a later time. For instance, if the cache memory, e.g., the cache memory 13-1, runs out of storage space, the cache memory 13-1 may write back a modified data line to provide an available storage space for an incoming data line. Alternatively, as described in greater detail below, the cache memory 13-1 may write back a data line when another device attempts to read that data line.
The I/O bridge 17 interconnects the system bus 16 and I/O expansion bus 16'. One or more I/O devices 14, such as Ethernet interfaces, FDDI interfaces, SCSI interfaces, disk drives, etc., are connected to the I/O expansion bus 16'.
The purpose of the I/O bridge 17 is to "decouple" the system bus 16 and the I/O expansion bus 16'. Typically, data is transmitted in different formats and at different speeds on these two busses 16 and 16'. For instance, data may be transmitted in sixteen byte packets on the system bus 16 at 33 MHz while data is transmitted in four byte groups at 8 MHz on the I/O expansion bus 16'. The I/O bridge 17 may receive data packets from a device, e.g., the processor 12-1, connected to the system bus 16, and temporarily store the data of these packets therein. The I/O bridge 17 then transmits the received, "depacketized" data in four byte groups to an I/O device 14 on the I/O expansion bus 16'. Likewise, the I/O bridge 17 may receive and temporarily store data from an I/O device 14 via the I/O expansion bus 16'. The I/O bridge 17 then transmits the received data in packets to a device, e.g., the main memory 18, connected to the system bus 16.
As noted above, the processors 12-1 to 12-n, the cache memories 13-1 to 13-n and the I/O bridge 17 must operate in a manner that maintains the consistency of the data in the main memory 18. This is complicated by the "write back" scheme employed in the computer system 10. For instance, suppose a first cache memory 13-1 modifies a copy of a data line of the main memory 18 but does not write the data line back to the main memory 18. If a second cache memory 13-2 issues a command to read the same data line, the second cache memory 13-2 should receive a copy of the modified data line in the first cache memory 13-1, not the stale copy stored in the main memory 18.
To this end, the devices of the computer system 10 implement an ownership protocol. Before a device may access particular data, the device must successfully "claim ownership" in the corresponding data line. A device which does not successfully claim ownership in a data line cannot access the data corresponding thereto.
Illustratively, the ownership protocol is implemented as follows. Suppose the I/O bridge 17 desires to access a particular data line. For instance, when the I/O device 14 desires to write data to the main memory 18, the I/O bridge 17 must claim ownership in the data lines stored in the destination addresses of the data to be written by the I/O device 14. (In fact, before an I/O bridge 17 can receive each datum to be written from the I/O device 14 to the main memory 18, the I/O bridge 17 must own the corresponding data line.) The I/O bridge 17 first issues a command for claiming ownership in the particular data line on the system bus 16. This ownership claiming command may simply be a command to read or write the particular data line. Each device that can own a data line monitors or "snoops" the system bus 16 for ownership claiming commands. After issuing the ownership claiming command, the I/O bridge 17 also monitors the system bus 16 for a specified period. If another device currently owns the data line for which the I/O bridge 17 issued the ownership claim, this device may issue a response as described below. If, during the specified period, the I/O bridge 17 does not detect a response from another device indicating that another device already owns the data line, the I/O bridge 17 successfully claims ownership of the data line.
Suppose that, at the time the I/O bridge 17 issues the ownership claiming command, a cache memory 13-2 already owns, but has not modified the data line. Illustratively, the cache memory 13-2 detects the command issued by the I/O bridge 17. In response, the cache memory 13-2 illustratively concedes ownership of the data line to the I/O bridge 17. To that end, the cache memory 13-2 simply marks its copy of the cache line invalid. At a later time, if the cache memory 13-2 desires to access data corresponding to this data line, the cache memory 13-2 must first claim ownership in the data line and successfully obtain a fresh copy of the data line.
Alternatively, the cache memory 13-2 may mark the data line shared if the I/O bridge 17 indicates (from the ownership claim issued by the I/O bridge 17) that it does not desire to modify the data. Furthermore, the cache memory 13-2 issues a command to the I/O bridge 17 indicating that the data line is shared. Two or more devices can share ownership in a data line provided that none of the sharing devices has any intention of modifying the data line (that is, each sharing device wishes to read the data but not write the data). If one of the sharing devices later wishes to modify the data, that device issues an ownership claiming command which causes the other sharing devices to concede exclusive ownership to the device issuing the ownership command claim.
Suppose that, at the time the I/O bridge 17 issues the ownership claim, the cache memory 13-2 already owns, has modified, but has not yet written back the data line in which the I/O bridge 17 attempts to claim ownership. In this case, the cache memory 13-2 first issues an intervention command on the system bus 16. The cache memory 13-2 then writes back its modified copy of the data line to the main memory 18.
In response to detecting the intervention command, the I/O bridge 17 can do one of a number of things. The I/O bridge 17 can reissue its ownership claiming command at a later time after the cache memory 13-2 has relinquished control of the data by writing the data back to the main memory 18. Alternatively, the I/O bridge 17 may utilize a "snarfing" process described below for simultaneously receiving the data at the same time the data is written back to the main memory 18. These alternatives are illustrated in FIGS. 2 and 3. In FIGS. 2 and 3:
FIG. 2 is a timing diagram showing various signals generated during a first alternative memory transfer scheme. In FIG. 2, during cycle one of the system clock SCLK, the I/O bridge 17 issues a command for claiming ownership in a data line. This command is detected by the cache memory 13-2 which issues, on cycle four, the signals CDM# and CAN# indicating that it already owns, has modified, but has not yet written back the data line in which the I/O bridge 17 attempted to claim ownership. (The main memory 18 also responds with the SLD# signal to indicate it received the command. However, this event is insignificant as the CDM# and CAN# signals cause the main memory 18 to abort transmitting data to the I/O bridge 17.) The cache memory 13-2 then issues a write command on cycle six and writes back the modified data line on cycles nine to twelve.
Meanwhile, in response to the CAN# signal, the I/O bridge 17 illustratively reissues its ownership claim on cycle six. The cache memory 13-2 detects this command and issues the CAN# signal on cycle nine to "negatively acknowledge" the command of the I/O bridge 17 (indicating that the command was not acknowledged). Subsequently, the cache memory 13-2 issues a write command on cycle 8 and writes back the data to the main memory 18 via the data bus of the system bus 16 on cycles nine to twelve. Finally, on cycle eleven, the I/O bridge 17 successfully issues its ownership claiming command. Assuming the I/O bridge 17 issues a read command, the data is returned to the I/O bridge 17 via the data bus of the system bus 16 on cycles seventeen to twenty (not shown).
In the process illustrated in FIG. 2, the I/O bridge 17 must wait until after the cache memory 13-2 writes back the data to the main memory 18. Then, the I/O bridge 17 can successfully re-issue its ownership claiming command to claim ownership in the data, e.g., read the data from the main memory 18. This process is disadvantageous because many cycles are utilized to transfer ownership of the data line to the I/O bridge 17. Furthermore, the system bus 16 is utilized twice; once to transfer the modified data from the cache memory 13-2 to the main memory 18 and once to transfer the same data from the main memory 18 to the I/O bridge 17.
FIG. 3 illustrates an alternative transfer scheme called "memory reflection." As before, the I/O bridge 17 issues its ownership claim command on cycle one. Likewise, the cache memory 13-2 responds on cycle four to indicate that it already owns a modified copy of the data line in which the I/O bridge 17 has attempted to claim ownership. Furthermore, the cache memory 13-2 issues a write command on cycle six and writes back the modified cache line to the main memory 18 on cycles seven to ten. This is possible because the I/O bridge 17 does not re-issue its command for claiming ownership in the cache line on cycle six. Rather, the I/O bridge 17 enters a tracking mode in which the I/O bridge 17 monitors the command bus of the system bus 16 for the write command issued by the cache memory 13-2. Thus, on cycle six, the I/O bridge 18 can detect the cache memory's 13-2 command and address for writing back the data line in which the I/O bridge 17 unsuccessfully claimed ownership. When the cache memory 13-2 transfers the data to the main memory 18 on cycles seven to ten, the I/O device 17 "snarfs" or "eavesdrops" on the data intended for the main memory 18 on cycles seven to ten from the data bus of system bus 16. Thus, the I/O bridge 17 receives the data at the same time as the main memory 18. This obviates the need for the I/O bridge 17 to issue a subsequent read command to the main memory 18, resulting in substantial savings of time.
Stated more generally, the memory reflection scheme is utilized by a "write back agent", a "memory subsystem agent" and one or more "snarf agents." A "write back agent" is a device, such as the cache memory 13-2, which writes back a modified data line. A "memory subsystem agent" is a device, such as the main memory 18, in which the integrity of the data must be maintained. A "snarfing agent" is a device, such as the I/O bridge 17, which attempts to claim ownership in the data line. When the write back agent writes back the data line to the memory subsystem agent, the snarfing agent snarfs the data. The memory reflection scheme requires approximately one-half the time of the first alternative process. Moreover, the memory reflection scheme utilizes only one data transfer on the system bus 16 to transfer data to two destinations contemporaneously.
The devices 12-1 to 12-n, 13-1 to 13-n, 14, 15, 17 and 18 of computer system 10 are divided into two groups: masters and slaves. Devices which issue commands are called masters. Devices which respond to commands are called slaves. Typically, programmable devices such as processors or CPUs are masters. Cache memories, because they issue read and write commands to the main memory, are also classified as masters. An I/O bridge, because it too can issue read and write commands to the main memory, can be classified as a master. Other examples of masters are SCSI interfaces and hard disk controllers. The main memory 18 is typically classified as a slave device because it responds to the read and write commands issued by the master devices 12, 13, and 17. Other examples of slave devices are I/O devices such as printers, monitors and hard disk drives. In general, data is exchanged between the masters and slaves in response to the commands issued by the masters. For example, the master devices generate write commands for writing particular data into, and read commands for reading out particular data from, particular slave devices. There may be a plurality of both master devices and slave devices in the computer system 10.
As noted above, the CPUs 12-1 to 12-n can operate at a clock speed of up to 133 MHz. The system bus 16 typically operates at a clock speed of 33 MHz. The main memory 18, on the other hand, often operates at a clock speed of approximately 16.7 MHz. Because the main memory 18 is several times slower than the CPUs 12-1 to 12-n, the CPUs 12-1 to 12-n remain idle for many clock cycles while waiting for the main memory 18 to process a previously received commands.
As noted above, the clock speeds of the masters far exceed the clock speed of the main memory 18 (e.g., 16.7 MHz). It is possible that several masters can issue data access commands in succession before the main memory 18 satisfies the first command. To accommodate such a scenario, and to further increase efficient use of the main memory 18, a memory controller 19 is employed as an intermediary between the main memory 18 and the system bus 16. As shown in FIG. 4, the memory controller 19 contains a set of three buffers: the command buffer 20, the write buffer 22, and the read buffer 23. The command buffer 20 typically has four command buffer slots 21-1 to 21-4. Each command buffer slot 21 contains either a read or write command that was issued by one of the masters, e.g., 12-1 to 12-n, 13-1 to 13-n, 17, etc. In addition, the command buffer slot 21 also contains a memory address indicating where the data is to be read from or written to. The data associated with a write command can be temporarily stored in a write buffer 22 before being written to the main memory 18. The read buffer 23, on the other hand, is for buffering the data transferred from the main memory 18 to the system bus 16.
FIG. 4 shows three write commands (W1, W2, and W3) being stored in the command buffer 20, and their corresponding data stored in the write buffer 22 (indicated as shaded slots). In the case depicted in FIG. 4, a read command, R, e.g., issued by CPU 12-1, has entered the command buffer 20 after three previously issued write commands W1, W2 and W3. In FIG. 4, the memory controller 19 uses a simple "first-in first-out" (FIFO) or "strictly in-order" scheme to process read and write commands R, W1, W2 and W3. As the name implies, the read and write commands R, W1, W2, W3 are handled in the order in which they are received by the memory controller 19. Unfortunately, in this example, the CPU 12-1 that issued the read command R cannot be immediately updated because the read command is "blocked", i.e., delayed, by the previous three write commands W1, W2 and W3.
The blocking of read commands causes a negative impact on CPU utilization. Under normal conditions, after a CPU e.g., the CPU 12-1, issues a write command and sends the data to the memory controller 19, the CPU 12-1 can continue processing subsequent jobs. However, due to data dependencies after a read command is issued, the issuing CPU 12-1 may have to remain idle until the requested data is returned from the main memory 18. Consequently, the response time of read operations is much more important to the overall performance of the computer system 10 than the response time of write operations. In the case depicted in FIG. 4, the response time for the read statement is four times longer than the actual memory read time (assuming that read and write operations require the same memory access time). This problem is exacerbated when multiple CPUs share the main memory 18. In particular, consider that the multiple processors 12-1 to 12-n may be simultaneously executing program instructions. With multiple processors executing program instructions, the number of read and write commands issued to the memory controller 19 will increase linearly. Hence the average number of operations queued in the command buffer 20 may grow accordingly. The result is that a newly arriving read command must wait in the memory controller command buffer 20 for a longer period of time.
There have been a variety of schemes to reduce the "memory read response time" (MRRT) of the main memory 18. U.S. Pat. No. 5,333,276 teaches an I/O bridge which decreases the memory response time for read and write commands that cross both the system bus 16 and I/O expansion bus 16' (i.e., read and write commands issued by an I/O device 14 to the main memory 18 or from a CPU 12-1 to 12-n to an I/O device 14.). In particular, this reference teaches a computer system with a single master, namely, the CPU or processor. The master performs read and write operations on multiple slaves (e.g., I/O modules, main memory, etc.) The technique taught in this reference is not well suited to computer systems with multiple masters, especially those that utilize snarfing. Furthermore, when multiple slaves are present, the technique taught in this reference requires a large and very complex state diagram which is difficult to implement.
Another prior art method for reducing MRRT is, the "read-around-write" scheme. Under the read-around-write scheme, a read command has a higher priority in the command buffer 20 and can bypass previous write commands. The read-around-write scheme is shown in FIG. 5. FIG. 5 shows that, although a read command R enters the buffer later than three previously received write commands, W1, W2, and W3, the read command can bypass (dashed line) the write commands W1, W2, and W3 and be the first command executed. The MRRT in the read-around-write case in FIG. 5 is the same as the true memory access time that a read command requires.
Although the read-around-write scheme works well in the previously mentioned scenario, there are other scenarios where this scheme does not yield the minimum MRRT. Such a scenario is shown in FIG. 6, where a read command R follows three write W1, W2 and W3 commands. In this case, the read command R accesses the same addressed memory location as one of the write commands W2. As such, the read command R advances in execution priority before the write command W3 (dashed line). However, the execution of the read command R is delayed until the write command W2 is executed, in order to ensure data consistency. Consequently, the execution of the W2 command would occur only after the execution of the W1 command and, apparently, the execution of the read command R would be delayed until the execution of the W2 command. In this case, the MRRT is three times longer than the true memory access time that a read command requires.
It is, therefor, an object of the present invention to present a memory controller scheme with improved MRRT.