The present invention relates generally to high-performance parallel multi-processor computer systems and more particularly to a speculative recall and/or forwarding method to accelerate overall data transfer between processor caches in cache-coherent multi-processor systems.
Many high-performance parallel multi-processor computer systems are built as a number of nodes interconnected by a general interconnection network (e.g., crossbar and hypercube), where each node contains a subset of the processors and memory in the system. While the memory in the system is distributed, several of these systems (called NUMA systems for Non-Uniform Memory Architecture) support a shared memory abstraction where all the memory in the system appears as a large memory common to all processors in the system. To support high-performance, these systems typically allow processors in various nodes to maintain copies of memory data in their local caches. Since multiple processors can cache the same data, these systems must incorporate a cache coherence mechanism to keep the copies consistent, or coherent. These cache-coherent systems are referred to as ccNUMA systems and examples are DASH and FLASH from Stanford University, ORIGIN from Silicon Graphics, STING from Sequent Computers, and NUMAL from Data General.
Coherence is maintained in ccNUMA systems using a directory-based coherence protocol. With coherence implemented in hardware, special hardware coherence controllers maintain the coherence directory and execute the coherence protocol. To support better performance, the coherence protocol is usually distributed among the nodes. With current solutions, a coherence controller is associated with each memory unit that manages the coherence of data mapped to that memory unit. Each line of memory (typically a portion of memory tens of bytes in size) is assigned a home node, which manages the sharing of that memory line, and guarantees its coherence.
The home node maintains a directory, which identifies the nodes that possess a copy of the memory line. When a node requires a copy of the memory line, it requests the memory line from the home node. The home node supplies the data from its memory if its memory has the latest data. If another node has the latest copy of the data, the home node directs this node to forward the data to the requesting node. The home node employs a coherence protocol to ensure that when a node writes a new value to the memory line, all other nodes see this latest value. Coherence controllers implement this coherence functionality.
In typical multi-processor systems, exchanging messages on the network and looking up tables are fairly lengthy operations. Hence, substantial time may elapse between the time access to a data block is requested and the time the data block is received from another processor""s cache. This latency is especially high when the requesting processor, the memory and coherence controller managing the data block, and the processor with the modified data are in three different nodes of the system since at least three inter-node messages are necessary. For example, this latency may be about 250 processor clock cycles. As processors continue to increase in their speed relative to the speed of the network and memory, this latency will progressively get higher. In many situations (such as when the processor wants to read the memory data block), the processor cannot perform any useful computation while it waits for the data block to arrive from the cache of the other processor. This leads to inefficient utilization of expensive processor resources and overall poor performance of the application.
The long latency in accessing modified data from another processor""s and its negative impact on application performance is a well-known problem. Several solutions have been proposed to alleviate this problem. The mechanisms in the prior art all follow the approach of propagating data modifications to the copies in other processor""s caches so that a processor can access the latest data in its cache itself.
In the typical cache-coherent multi-processor system, when a memory data block required (for reading or for writing) by a processor is not currently available in its cache, a message must be sent to the memory system requesting a copy of the data block. If the required memory data block is present in another processor""s cache with a modified value, this new value must be provided to the requesting processor (this is called a cache-to-cache transfer). With typical coherence protocols, this is accomplished in the following way. When a processor A requires access to a data block, it sends a message to the memory and coherence controller managing the data block requesting a copy of the data block. The memory and coherence controller determines from a table that the data block is potentially in a modified state in another processor B""s cache. The memory and coherence controller sends a message to processor B requesting that the data block be sent to processor A. Upon receiving the message, processor B sends the data block to processor A and also notifies the memory and coherence controller that it has done so.
In other past multi-processor systems, which use write-update coherence protocols, when a processor modified a data block in its cache, the modified data block is immediately forwarded to all processors that have a copy of the data block in their cache. Since all copies of the data block are updated on every write, a processor accessing the data block in its cache will observe the latest value of the data block in its cache itself. The processor""s access, hence, does not incur the latency of network messages and table lookup. Write-update protocols are not suitable, however, for several reasons. Firstly, commercial microprocessors do not support the write-update protocol (they support the write-invalidate protocol). Since the cache hierarchy in commercial processors is write-back, the caches do not propagate each write to the processor bus. Also, when a data block is to be modified, most processor bus protocols invalidate the data block in all other caches rather than updating them with the new value. Furthermore, while updates require that data be supplied to a cache that did not request it, processor bus protocols do not support any transaction that transfers data without an associated request on the bus. Secondly, write-update protocols are wasteful in bandwidth and can degrade performance. Updating all copies of a data block on each write to the data block can be wasteful because a processor receiving the updates may not use the data block at all. Also, updates of each individual write may be unnecessary in cases when a processor uses the data block only after a series of modifications to the data block have been completed. Updates also impose substantial bandwidth load on the buses, networks and processor caches. This bandwidth load can cause increased contention and queuing delays in the system degrading performance. Thirdly, since updates are sent only to processors that have a copy of the data block, write-update protocols do not provide any benefit when a processor""s cache does not contain a copy of the data block.
Other past multiprocessor systems use what is known as the competitive-update mechanism, which is a hybrid between write-invalidate protocols and write-update protocols. As with write-update protocols, when a data block is modified all copies of the data block are updated. However, when a processor receiving the updates has not accessed its copy of the data block for several updates (a predetermined xe2x80x9ccompetitive thresholdxe2x80x9d), its copy of the data block is invalidated. Subsequent updates to the data block will not be sent to this processor. When updates are unnecessary, this approach minimizes update bandwidth over the pure write-update protocol. However, the competitive-update approach retains the other disadvantages: it wastes network bandwidth when the updates are not used (e.g. in migratory sharing), it mandates support for write-update protocols in the processors and processor bus protocols and it does not provide any benefit when a processor""s cache does not contain a copy of the data block.
Still other past multi-processor system introduced special processor instruction xe2x80x9cprimitivesxe2x80x9d that allow a processor to send a data block (or multiple data blocks) to the cache of another processor. When an application (or program) requires that a data block written by one processor must be accessed by another processor, the application""s code includes these primitives (at appropriate points in the code) to send the data block from the producer processor""s cache to the consumer processor""s cache. If the send is completed before the consumer processor accesses the data block, the access can be completed in its cache itself without additional latency. There are several disadvantages with this approach. First, it changes the programming model (e.g., the mechanism used to communicate between processors has been changed) provided to the applications. Existing applications must be re-written or recompiled to obtain any benefit. Second, it requires that the application programmer or the compiler be able to identify the instances when a data block written by one processor would be accessed by another (specific) processor. Third, the approach requires extensions to the processor instruction set and implementation and also requires support for updates in the processor cache design and in the processor bus protocol.
As a result, there has been a long sought need for a speculative recall and forwarding system, which would decrease overall data transfer time or latency between processor caches. A simple to implement system, which could be implemented without requiring any change to the processor architecture, compilers or programming model, has long eluded those skilled in this art.
The present invention provides a system that supports better processor utilization and better application performance by reducing the latency in accessing data by performing proactive speculative data transfers. In being proactive, the system speculates, without specific requests from the processors, as to what data transfers will reduce the latency and will make the data transfers according to information derived from the system at any time that data transfers could be made.
The present invention provides a system that supports better processor utilization and better application performance by reducing the latency in accessing data by performing proactive speculative data forwarding. In being proactive, the system speculates, without specific requests from the processors, as to what data transfers will reduce the latency and will forward the data to a processor likely to need it according to information derived from the system at any time that data transfers could be made.
The present invention provides a system that supports better processor utilization and better application performance by reducing the latency in accessing data by performing proactive speculative data recall. In being proactive, the system speculates, without specific requests from the processors, as to what data transfers will reduce the latency and will recall the modified data from caches according to information derived from the system at any time that data transfers could be made.
The present invention provides a system that supports better processor utilization and better application performance by reducing the latency in accessing data by performing proactive speculative data transfers. In being proactive, the system speculates, without specific requests from the processors, as to what data transfers will reduce the latency and will make the data transfers according to historical information derived from the system at any time that data transfers could be made.
The present invention is simple to implement and can be implemented without requiring any change to the processor architecture, compilers or programming model.