Multiprocessor computers by definition contain multiple processors that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known a s parallel computing. In general, multiprocessor computers execute multithreaded-programs or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded-program and/or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor, requiring processes in such a system to communicate through explicit messages to each other.
Shared memory multiprocessor computers may further be classified by how the memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near each processor. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory systems are also called non-uniform memory access (NUMA) machines. In centralized shared memory computers, on the other hand, the memory is physically in one location. Centralized shared memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are organized into nodes with one or more processors per node. Also included in the node are local memory for the processors, a remote cache for caching data obtained from memory in other nodes, and logic for linking the node with other nodes in the computer. A processor in a node communicates directly with the local memory and communicates indirectly with memory on other nodes through the node's remote cache. For example, if the desired data is in local memory, a processor obtains the data directly from a block (or line) of local memory. But if the desired data is stored in memory in another node, the processor must access its remote cache to obtain the data. A cache hit occurs if the data has been obtained recently and is presently stored in a line of the remote cache. Otherwise a cache miss occurs, and the processor must obtain the desired data from the local memory of another node through the linking logic and place the obtained data in its node's remote cache.
Further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer Architecture: A Quantitative Approach (2.sup.nd Ed. 1996), by D. Patterson and J. Hennessy, which is incorporated by reference.
Data coherency is maintained among the multiple caches and memories of a distributed shared memory machine through a cache coherency protocol such as the protocol described in the Scalable Coherent Interface (SCI)(IEEE 1596). Central to the coherency protocol is the use of doubly linked sharing list structures to keep track of the cache lines from separate remote caches that share the same data. When the data in one of the linked cache lines changes, such as by a processor writing to the line, the other cache lines on the list are determined and then invalidated, and the list is purged (i.e., dissolved).
An SCI sharing list is constructed using tags that are associated with each line of memory and each line of a remote cache. The memory tag includes a state field and a head pointer that, when a sharing list exists for the memory line, points to the node that is the head of the list. The cache tag includes a state field, a backward pointer to the next list element toward the memory line and a forward pointer to the next list element toward the tail of the list. If the node is the head of the list, the backward pointer of the cache line points to the memory line whose data it is caching.
A sharing list is formed or increased whenever a processor tries to read from or write to a line of data that is not present in its remote cache or local memory. In these cases a processor will request the data from the remote memory storing the data. If no cached copies of the line exist in the computer system, then memory responds with the data. A sharing list is formed with a cache line on the requesting processor's node now storing the data. The pointers in the memory and cache line tags are changed to designate the node containing the cache line as the head of the list, with the cache line's forward pointer set to null since there are no other list elements. If a cached copy of the data already exists in the computer system, the memory still responds with the data if it is valid; otherwise, the data is obtained from the present head of the list. Again, the pointers in the memory and cache line tags are then changed to designate the node reading or writing the data as the head of the list.
When a processor writes to a memory line that points to a sharing list, the list must be invalidated since the other cache lines on the list no longer have copies of the most current data. The SCI scheme for invalidating a sharing list is shown in FIGS. 1A-C, where a processor has written to the cache line whose node is at the head of the list. (If a processor attempts to write to a cache line whose node is not the head of a list, that node is first made the head of the list.) In FIG. 1A, the state of a sharing list is shown before the scheme is initiated. Node N is the head of the list. As indicated by the bidirectional arrows, its cache line points forward to node Y, whose cache line points backward to node N. Similarly, the cache line on node Y points forward to node Z, whose cache line points backward to node Y. Since the cache line on node Z does not point forward to another cache line in this example, it is the tail of the list. In FIG. 1B, node N issues an SCI invalidate request to node Y to remove its cache line from the list. Node Y responds by changing the state of its cache line to indicate its data is invalid and by issuing an invalidate response to node X. This response confirms that the cache line has been invalidated and that node Y has been removed from the list. The response also includes the forward pointer to node Z. Using this forward pointer, node N then issues an SCI invalidate request to node Z to remove its cache line from the list. Node Z responds by changing the state of its cache line to indicate its data is invalid and by issuing an invalidate response to node N. FIG. 1C shows the state of the nodes' cache lines after the sequence of invalidate requests is complete and the sharing list has been dissolved. The state of the cache line on node N indicates that only its cache line now has valid data (even the memory data is invalid). The states of the cache lines on nodes Y and Z indicate that their copies of the data are invalid.
Although this scheme works, it is relatively slow. An invalidate request requires a response by each element of the sharing list to the head. Eliminating these multiple responses would accelerate the purging of sharing lists and thereby improve the overall performance of computer systems running protocols such as the SCI cache coherence protocol.
An objective of the invention, therefore, is to accelerate the communication of requests to a list of elements such as lists created under the SCI protocol.