This invention relates generally to multiprocessor systems that are comprised of a number of separate but interconnected processor nodes. More particularly, this invention relates to a method for reducing the time required to invalidate shared cache lines that reside on the separate nodes when a copy of the data in one of the lines is changed.
Multiprocessor computer systems by definition contain multiple processors that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor systems execute multithreaded-programs or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded-program and/or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor systems may be classified by how they share information among the processors. Shared memory multiprocessors offer a common physical memory address space that all processors can access. Multiple processes or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessors, in contrast, have a separate memory space for each processor, requiring processes to communicate through explicit messages to each other.
Shared memory multiprocessors can further be classified by how their memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near groups of one or more processors. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory machines are also called non-uniform memory access (NUMA) machines. In centralized shared memory machines, on the other hand, all of the memory is physically in one location. Centralized shared memory machines are also called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor systems with distributed shared memory are often organized into nodes with one or more processors per node. Included in the node may be a node bus, local memory for the processors, a remote cache for caching data obtained from memory in other nodes, and logic for linking the node with other nodes in the computer. A processor in a node communicates directly with the local memory and communicates indirectly with memory on other nodes through the node""s remote cache. For example, if the desired data is in local memory, a processor obtains the data directly from a block (or line) of local memory. However, if the desired data is stored in memory in another node, the processor must access its remote cache to obtain the data. A remote cache hit occurs if the data has been obtained recently and is presently stored in a line of the remote cache. Otherwise a cache miss occurs, and the processor must obtain the desired data from the local memory of another node through the linking logic and place the obtained data in its node""s remote cache. Further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer Architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy.
Data coherency is maintained among the multiple caches and memories of a distributed shared memory machine through a cache coherency protocol such as the protocol described in the Scalable Coherent Interface (SCI)(IEEE 1596). Central to this coherency protocol is the use of doubly linked sharing list structures to keep track of the cache lines of separate remote caches that share the same data. Before the data in one of the linked cache lines can change, such as by a processor writing to a cache line, the other cache lines on the list must be invalidated and the list purged (i.e., dissolved).
An SCI sharing list is constructed using tags that are associated with each line of memory and each line of a remote cache. A memory tag includes a state field and a head pointer that, when a sharing list exists for the memory line, points to the node that is the head of the list. A remote cache tag includes a state field, a backward pointer, and a forward pointer. The backward pointer points to the node that is the next list element toward the home node. The forward pointer points to the node that is the next list element toward the list tail. In the remote cache tag at the list head, the backward pointer points to the home memory whose data is cached. In the remote cache tag at the list tail, the forward pointer is null.
A sharing list is formed, increased, or dissolved whenever a processor tries to read from or write to a line of data that is not present in its remote cache or local memory. In these cases a processor then requests the data from the remote memory storing the data. If no cached copies of the line exist in the computer system, the remote memory responds with the data. A sharing list is then formed with a cache line on the requesting processor""s node now storing the data. The pointers in the memory and remote cache tags are changed as described above to designate the node containing the cache line as the list head, with the remote cache tag""s forward pointer set to null since there are no other list elements. If a cached copy of the data already exists in the computer system, the memory still responds with the data if it is valid; otherwise, the data is obtained from the present head of the list. Again, the pointers in the memory and remote cache tags are changed to designate the node of the requesting processor as the list head and the old head is moved down the list.
An old sharing list is dissolved and its cache lines invalidated when a processor writes to a memory line that corresponds to the sharing list. The SCI scheme for invalidating a sharing list is shown in FIGS. 1A-C, where a processor writes to a cache line that is shared by other nodes. In FIG. 1A, the state of a sharing list is shown before the scheme is initiated. Node N is the head of the list. As indicated by the bidirectional arrows, the remote cache tag for the shared line points forward to node Y, whose remote cache tag for the line points backward to node N. Similarly, the remote cache tag for the cache line on node Y points forward to node Z, whose remote cache tag for the share line points backward to node Y. Since the remote cache tag for the line on node Z does not point forward to another cache line in this example, node Z is the tail of the list. In FIG. 1B, node N issues an SCI invalidate request (1) to node Y to remove its cache line from the list. Node Y responds through its cache controller by issuing a bus invalidate request (2) on its node bus to its bus controller. The bus controller invalidates the copies of the cache line that exist in the processors"" local caches and acknowledges to the cache controller when the invalidate is complete (3). The cache controller then updates the state of its cache line in the remote cache to invalid and issues an invalidate response to node N (4). This response confirms that the cache line has been invalidated and that node Y has been removed from the list. The response also includes the forward pointer to node Z. Using this forward pointer, node N then issues an SCI invalidate request to node Z to remove its cache line from the list (5). Node Z responds in the same manner as node Y. by issuing a bus invalidate request (6) and a bus invalidate response (7) internally and an SCI invalidate response to Node N (8). FIG. 1C shows the state of the nodes"" cache lines after the sequence of invalidate requests is complete and the sharing list has been dissolved. The state of the cache line on node N indicates that only its cache line now has valid data (even the memory data is invalid). The states of the cache lines on nodes Y and Z indicate that their copies of the data are invalid.
Although this scheme works, it is slow for several reasons. First, the invalidate request requires multiple responses to the head of the sharing list, one from each element node of the list. Second, each element node must complete the invalidation of the copies of the cache lines stored in the processors"" local caches before acknowledging that the invalidation is complete. The problem of multiple responses may be addressed by forwarding the invalidate request from one node to another rather than responding to the list head. The serial nature of the invalidation scheme, however, persists. An element node does not respond to the head of the list or forward the invalidate request to a succeeding node until the element node invalidates all copies of the cache line within the local caches and remote cache.
An objective of the invention, therefore, is to reduce the time for invalidating cache lines on a sharing list by performing some of the requisite actions in parallel. Another objective of the invention is to reduce such time by having a node acknowledge that the invalidate request is complete before all copies of the cache line thereon have actually been invalidated. Another objective of the invention is to reduce such time while maintaining the appropriate processor consistency model.
A method for invalidating cache lines in a multiprocessor system includes, in response to an invalidate request from a first node, issuing an invalidate acknowledgement by a second node before the cache line on the second node is actually invalidated. The early invalidate acknowledgement can be issued by sending it to the first node or forwarding it to a third node, as desired.
The method may further include, after issuing the invalidate acknowledgement, setting a flag indicating that an invalidation is pending and, after completing the invalidation of the cache line, clearing the flag. This is done to maintain processor consistency.
The method may also include receiving at the second node a cacheable transaction. The flag is then checked to determine whether an invalidation is pending at the second node. If so, execution of the cacheable transaction is delayed until the flag is clear. The cacheable transaction can be any transaction that affects processor consistency, such as issuance of a read response to a node bus, an issuance of a write request to a node bus, or an issuance of an interrupt.
Similarly, a multiprocessor system capable of executing the above methods is shown and described.
The method is particularly applicable to multiprocessor systems using the SCI protocol, in which shared cache lines are connected via a sharing list.
These and other aspects and features of the invention will become apparent from the following detailed description and accompanying drawings.