1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to address translations employed within multiprocessor computer systems having distributed shared memory architectures.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).
Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or xe2x80x9csnoopedxe2x80x9d) against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.
Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.
These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Distributed shared memory systems may employ local and global address spaces. A portion of the global address space is assigned to each node within the distributed shared memory system. Accesses to the address space assigned to a requesting node (i.e. local address space) are typically local transactions. Accesses to portions of the address space not assigned to the requesting node are typically global transactions.
In some distributed shared memory systems, data corresponding to addresses of remote nodes may be copied to a requesting node""s shared memory such that future accesses to that data may be performed via local transactions rather than global transactions. In such systems, CPU""s local to the node may use the local physical address assigned to the copied data. The copied data is referred to as a shadow page. Address translation tables are provided to translate between the global address and the local physical address assigned to the shadow copy.
During coherency operations, such as a request to obtain sufficient access rights to perform a transaction, the local physical address is translated to a global address. If the local physical address does not correspond to a shadow copy, the global address is the same as the local physical address (i.e., no translation is required).
Unfortunately, a local node typically cannot distinguish an access to a shadow page, which requires an address translation, and an access to a local address that does not require address translation. Accordingly, the local node typically performs an address translation on all local physical addresses during coherency operations. These address translations add unnecessary latency to local memory accesses and increase the bandwidth requirement of the address translation table.
A multiprocessor computer system that eliminates unnecessary address translations is thus desirable.
The problems outlined above are in large part solved by a multiprocessor computer system in which the local physical memory of a node includes two address spaces. Both a local address space and a coherent memory replication (CMR) address space are mapped to the local physical memory of a node. When a shadow copy is stored in a node, the data is assigned an address within the CMR space. Local data is assigned addresses within the local address space. When coherency operations occur, the address translation circuitry can determine whether the accessed data is local data or a shadow copy based upon the address. Accordingly, the address translation circuitry can perform a local physical address to global address translation for shadow copies. For addresses within the local address space, an address translation is not performed, which reduces the latency of the local data access and the bandwidth requirement of the address translation circuit.
Broadly speaking the present invention contemplates, a multiprocessor computer system comprising a first node, a second node and a global bus. The first node includes a first processor, a first cache coupled to the first processor, a first local bus coupled to the first cache, a first local memory coupled to the first local bus and a first system interface coupled to the first local bus. A first address space and a second address space are mapped to the first local memory and the first address space is configured to store data local to the first node. The first system interface includes a first directory configured to store coherency data for data to the first node. The second node includes a second processor, a second cache coupled to the second processor, a second local bus coupled to the second cache, a second local memory coupled to the second local bus and a second system interface coupled to the second local bus. The second system interface includes a second directory configured to store coherency data for data local to the second node. The global bus is coupled to the first system interface and the second system interface. The first address space is configured to store data local to the first node and the second address space is configured to store copies of data local to the second node. The data stored in the second address space is assigned a physical address local to the first node and the first system interface converts the physical address local to the first node to a global address prior to performing a request on the global bus.
The present invention further contemplates a method of performing selective address translation in a multiprocessing computer system comprising: mapping a first address space and a second address space to a local memory of a first node of the multiprocessing computer system; storing data local to the first node in the first address space; storing copies of data local to a second node of the multiprocessing computer system in the second address space, wherein the copies of data stored in the second address space are assigned local addresses of the first node; and converting the local addresses of the data stored in the second address space to global addresses prior to performing a global operation.