1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to block copy operations in multiprocessor computer systems.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA)
Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or xe2x80x9csnoopedxe2x80x9d) against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.
Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.
These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled therebetween. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Unfortunately, processor access to memory stored in a remote node (i.e. a node other than the node containing the processor) is significantly slower than access to memory within the node. In particular, block copy operations may suffer from severe performance degradation in a distributed shared memory system. Typically, block copy operations involve reading data from a source block and storing data to a destination block. The block is defined by the operating system employed by the computer system, and is typically several kilobytes in size. The processor performs the copy by reading the data from the source block and writing the data to the destination block. Certain advanced processors employ special instructions (read and write stream) which read and write cache lines of data without polluting the caches.
If the processor performing the block copy operation resides in the node having the destination block but not the source block, each read from the source block requires a remote node access. Remote node accesses are typically slow, and the corresponding write does not occur until the data has been provided. The processor is therefore occupied with the block copy operation for a considerable length of time. During most of the considerable length of time, the processor may be awaiting data transfer from the remote node. Unfortunately, the processor is stalled during this time period. Little, if any, useful work is performed by the microprocessor during this time period.
The performance of block copy operations is crucial to many operating systems. For example, the UNIX operating system depends upon an efficient block copy operation for high performance. It is therefore desirable to have an efficient block copy mechanism, even in a distributed shared memory architecture.
The problems outlined above are in large part solved by a computer system in accordance with the present invention. In order to perform a block copy from a remote source block to a local destination block, a processor within the local node of the computer system performs a specially coded write operation. This write operation signals to the system interface within the local node that a block copy operation is being requested; the data from the write operation is discarded. The system interface, upon detection of the specially coded write operation, performs a read operation to the source block in the remote node. Concurrently, the write transaction is allowed to complete in the local node such that the processor may proceed with subsequent computing tasks while the local node completes the copy operation. Advantageously, the read from the remote node and subsequent storage of the data in the local node is completed by the system interface in the local node, not by the processor. Since the processor may perform additional activities while the copy completes, performance of the computer system may be enhanced. Especially, the processor may begin a new block copy request. The new block copy request may then at least partially overlap with the first block copy request.
In one specific embodiment, the specially coded write operation is indicated using certain most significant bits of the address of the write operation. The address identifies the destination coherency unit within the local node, and a translation of the address to a global address identifies the source coherency unit. Subsequent to completion of the copy operation, the destination coherency unit may be accessed in the local node.
Broadly speaking, the present invention contemplates a method for performing block copy operations from a remote processing node to a local processing node in a multiprocessor computer system. A block copy write to at least one coherency unit within a destination block is executed by a processor within the local processing node. The local processing node detects the block copy write. Upon detection, the local node generates a read request identifying a corresponding coherency unit within a source block located by the remote processing node. The generated read request is then transmitted to the remote processing node. Data from the corresponding coherency unit is received into the local processing node, and is stored into the coherency unit within the destination block.
The present invention further contemplates an apparatus for performing block copy operations comprising a processor and a system interface. The processor includes a memory management unit configured to translate a virtual address of a memory operation to a local physical address or global address. The local physical address resides in a specific predefined address space if a block copy operation is to be performed. Coupled to receive the block copy operation from the processor, the system interface is configured to perform a translation from the local physical address to a global address. Additionally, the system interface is configured to transmit a read request including the global address via a network on behalf of the block copy operation. The system interface includes a translation storage for storing information for performing the translation from the local physical address to the global address on a page by page basis.
The present invention still further contemplates a computer system comprising first, second, and third processing nodes. The first processing node includes a request agent configured to perform a read request for a coherency unit upon execution of a block copy write to the coherency unit by a processor within the first processing node. The second processing node includes a home agent, and is coupled to receive the read request from the first processing node. The second processing node is a home node for the coherency unit. Upon receipt of the read request, the home agent is configured to identify an owner of the coherency unit. The home agent is configured to transmit a demand. The third processing node is coupled to receive the demand via a slave agent included therein. The slave agent is configured to convey data corresponding to the coherency unit to the first processing node upon receipt of the demand.
The present invention additionally contemplates an apparatus configured to perform efficient block copy operations comprising a processor and a system interface. The processor is configured to initiate a block copy write to at least one coherency unit within a destination block. The destination block is located within a local processing node which includes the processor. The system interface is configured to detect the block copy write within the local processing node and to transmit a read request for a corresponding coherency unit within a source block located within a remote processing node. The system interface transmits the read request upon detection of the block copy write. Additionally, the system interface is further configured to receive data from the corresponding coherency unit of the source block and to store the data into the coherency unit within the destination block.
Moreover, the present invention contemplates a method for performing block copies. A block copy command is initiated via a processor. The block copy command identifies a first coherency unit within a source block and a second coherency unit within a destination block. Data corresponding to the first coherency unit is transmitted from a first processing node storing the source block to a second processing node storing the destination block. The data is then stored into the second coherency unit.
The present invention still further contemplates an apparatus for performing block copies comprising a processor and a system interface. The processor is configured to execute a block copy command identifying a first coherency unit within a source block and a second coherency unit within a destination block. Coupled to receive the block copy command, the system interface is configured to transfer data from the first coherency unit to the second coherency unit in response to the block copy command.