The present invention relates to a method and an apparatus for sharing memory among coherence domains of computer systems. More specifically, the invention relates to a novel method and apparatus for efficiently solving coherence problems when memory blocks having local physical addresses (LPA) in a particular computer node of a computer system are shared by other nodes of the system as well as by external entities coupled to that computer node.
The sharing of memory among multiple coherence domains presents unique coherence problems. To facilitate a discussion of these coherence problems, FIG. 1 shows a computer node 100 representing, e.g., a computer node in a more complex computer system. Within computer node 100, there are shown a plurality of processing nodes 102, 104, and 106 coupled to a common bus 108. Each of processing nodes 102, 104, and 106 represents, for example, a discrete processing unit that may include, e.g., a processor and its own memory cache. The number of processing nodes provided per computer node 100 may vary depending on needs, and may include any arbitrary number although only three are shown herein for simplicity of illustration.
Within computer node 100, a common bus 108 is shown coupled to a memory module 110, which represents the memory space of computer node 100 and may be implemented using a conventional type of memory such as dynamic random access memory (DRAM). Memory module 110 is typically organized into a plurality of uniquely addressable memory blocks 112. Each memory block of memory module 110, e.g., memory block 112(a) or memory block 112(b), has a local physical address (LPA) within computer node 100, i.e., its unique address maps into the memory space of computer 100. Each memory block 112 represents a storage unit for storing data, and each may be shared among processing nodes 102, 104, and 106 via common bus 108. Of course, there may be provided as many memory blocks as desired to satisfy the storage needs of computer node 100.
As is known to those skilled in the art, computer processors, e.g., processor 116 within processing node 102, typically operates at a faster speed than the speed of the memory module 110. To expedite access to the memory blocks 112 of memory module 110, there is usually provided with each processing node, e.g., processing node 102, a memory cache 114. A memory cache, e.g., memory cache 114, takes advantage of the fact that a processor, e.g., processor 116, is more likely to reference memory addresses that it recently referenced than other random memory locations. Further, memory cache 114 typically employs faster memory and tends to be small, which further contributes to speedy operation.
Within memory cache 114, there exists a plurality of block frames 118 for storing copies of memory blocks, e.g., memory blocks 112. Each block frame 118 has an address portion 120 for storing the address of the memory block it cached. If the unique address of memory block 112(a) is, e.g., FF5h, this address would be stored in address portion 120 of a block frame 118 when memory block 112(a) of memory module 110 is cached into memory cache 114. There is also provided in block frame 118 a data portion 122 for storing the data value of the cached memory block. For example, if the value stored in memory block 112(a) was 12 when memory block 112(a) was cached into block frame 118, this value 12 would be stored in data portion 122 of block frame 118.
Also provided in block frame 118 is a status tag 124 for storing the state of the memory block it cached. Examples of such states are, e.g., gM, gS, and gI, representing respectively global exclusive, global shared, and global invalid. The meanings of these states are discussed in greater detail herein, e.g., with reference to FIG. 4.
A processing node may hold an exclusive copy of a memory block in its cache when it is the only entity having a valid copy. Such exclusive copy may potentially be different from its counterpart in memory module 110, e.g., it may have been modified by the processing node that cached it. Alternatively, a processing node may possess a shared, read-only copy of a memory block. When one processing node, e.g., processing node 102, caches a shared copy of a memory block, e.g., memory block 112(a), other processing nodes, e.g., processing nodes 104 and 106, may also possess shared copies of the same memory block.
If a memory block is never cached in a processing node or it was once cached but is no longer cached therein, that processing node is said to have an invalid copy of the memory block. No valid data is contained in the block frame when the state associated with that block frame is invalid.
The coherence problem that may arise when memory block 112 is shared among the processing nodes of FIG. 1 will now be discussed in detail. Assuming that processing node 102 caches a copy of memory block 112(a) into its memory cache 114 to change the value stored in memory block 112 from 12 to 13. Typically, when the value is changed by a processing node such as processing node 102, that value is not updated back into memory module 110 immediately. Rather, the updating is typically performed when memory cache 114 of processing node 102 writes back the copy of memory block 112(a) it had earlier cached.
Now suppose that before memory cache 114 has a chance to write back the changed value of memory block 112(a), i.e., 13, into memory module 110, processing node 104 wishes to reference memory block 112(a). Processing node 104 would first ascertain in its own memory cache 132 to determine whether a copy of memory block 112(a) had been cached therein earlier. Assuming that a copy of memory block 112(a) has never been cached by processing node 104, a cache miss would occur.
Upon experiencing the cache miss, processing node 104 may then proceed to obtain a copy of memory block 112(a) from memory module 110. Since the changed value of memory block 112(a) has not been written back into memory module 110 by processing node 102, the old value stored in memory block 112(a), i.e., 12, would be acquired by processing node 104. This problem is referred to herein as the coherence problem and has the potential to provide erroneous values to processing nodes and other devices that share a common memory.
Up to now, the sharing of memory blocks 112 is illustrated only with reference to devices internal to computer node 100, i.e., devices such as processing nodes 102, 104, and 106 that are designed to be coupled to common bus 108 and communicate thereto employing the same communication protocol. There may be times when it is necessary to couple computer node 100 to other external devices, e.g., to facilitate the expansion of the computer system. Oftentimes, the external devices may employ a different protocol from that employed on common bus 108 of computer node 100 and may even operate at a different speed.
External device 140 of FIG. 1 represents such an external device. For discussion purposes, external device 140 may represent, for example, an input/output (I/O) device such as a gateway to a network. Alternatively, external device 140 may be, for example, a processor such as a Pentium Pro.TM. microprocessor (available from Intel. Corp. of Santa Clara, Calif.), representing a processor whose protocol and operating speed may differ from those on common bus 108. As a further example, external device 140 may represent a distributed shared memory agent for coupling computer node 100 to other entities having their own memory spaces, e.g., other computer nodes having their own memory modules. Via the distributed shared memory agent, the memory blocks within computer node 100 as well as within those other memory-space-containing entities may be shared.
Although an external device may need to share the data stored in memory module 110, it is typically not possible to couple an external device, such as external device 140, directly to common bus 108 to allow external device 140 to share the memory blocks in memory module 110. The direct coupling is not possible due to, among others, the aforementioned differences in protocols and operating speeds.
In view of the foregoing, what is needed is an improved method and apparatus for permitting memory blocks having a local physical address (LPA) in a particular computer node to be shared, in an efficient and error-free manner, among interconnected entities such as other processing nodes and external devices.