1. Field of the Invention
The present invention generally relates to computers. More particularly, the present invention relates to computers that may have more than a single node, and where each node has more than a single processor.
2. Description of the Related Art
Early computer systems comprised a single processor, along with the processor's associated memory, input/output devices, and mass storage systems such as disk drives, optical storage, magnetic tape drives, and the like.
As demand for processing power increased beyond what was possible to build as a single processor computer, multiple processors were coupled together by one or more signal busses. A signal bus comprises one or more electrically conducting elements. For example a signal bus might simultaneously carry 64 bits of data from a first processor to a second processor. Many signal busses are logically subdivided and have an address bus portion, a control bus portion, and a data bus portion. Typically, signal busses in large computer systems further comprise parity or error correcting code (ECC) conductors to detect and/or correct errors that may occur during signal transmission.
Further demand for processing power forced computer designers to create computer systems having more than one node, where a node typically comprised more than one processor, each processor having several levels of cache dedicated to that processor. Each node has a relatively large amount of memory that has coherency managed by hardware. Computer systems having the capability to expand from a single node to many nodes is also advantageous in that a customer can start with a small—perhaps a single-node—system, and purchase more nodes as the customer's need for processing power grows. Such computer systems are scalable, in that the power of the computer systems scales with the customer's need for processing power.
Such a computer system typically features multiple processors on each node. If all processors on a node share a common processor bus, each processor in the node can “snoop” the processor bus for address references of load and store commands issued by other processors in the node to ensure memory coherency. Each processor in the node can then check and update the state of cache lines owned by each processor. For example, within a particular node, if a first processor makes a reference to a cache line currently in an L3 cache of a second processor, the second processor will recognize the reference and will send the cache line of the processor bus to the first processor, without the need for passing the cache line into and subsequently from a memory controller also coupled to the processor bus. Snoop cache techniques are well-known in the computer industry.
A problem exists in transmitting a high volume of requests and data over a processor bus shared by multiple processors. Bandwidth of data coming to or from an L4 memory, as well as requests for loads or stores, is shared by the processors sharing the processor bus and this sharing of bandwidth limits processing throughput of the node and therefore the computer system. The problem is further aggravated by the required electrical topology of the processor bus. For fastest data transmission, a very simple electrical configuration of a bus is implemented, ideally “point-to-point”, in which the bus couples only two units; for example, a single processor to a memory controller. As more couplings are added to the processor bus, the processor bus gets physically longer, and discontinuities of the physical connections introduce reflections on the processor bus, forcing a longer time period for each transmission of data. Therefore, the number of processors coupled to a processor bus is a performance limiter.
One technique used to limit the number of processors sharing a processor bus provides a separate processor bus for each processor (or, perhaps two processors, if bus topology allows acceptable performance). The following discussion, for simplicity, assumes two processors in a node, each processor having a processor bus coupled to itself and further coupled to a memory controller. While this technique provides two busses and simplifies the electrical topology of the interconnect, a first processor on a first processor bus can no longer directly “snoop” the load and store requests of a second processor coupled to the other processor bus in the node. The Memory controller could drive each load and store request seen on the first processor bus onto the second processor bus, and drive each load and store request seen on the second processor bus onto the first processor bus, but such a technique would be extremely wasteful and negate most of the advantages expected from providing a separate bus to each processor. To eliminate the need to drive each processor's load and store requests to the other processor, a snoop directory is typically designed as a fixed portion of a directory memory inside of, or coupled to, the memory controller. The snoop directory contains directory entries about cache lines used by any processor in the node. The memory controller uses the snoop directory to filter load and store requests from each processor so that only those load and store requests that the other processor must be aware of, or respond to, are forwarded to the other processor.
Each node must also retain directory entries for cache lines that have been sent to other nodes in the computer system. This information is stored in a remote memory directory in a portion of the directory memory that is not allocated to the snoop directory. In early computer systems, the allocation of directory memory is fixed, regardless of the number of nodes in the computer system. When a computer system is configured having only one node, no remote memory directory is in fact required, causing the memory allocated to the remote memory directory to be wasted. When a large number of nodes are installed in the computer system, the fixed partition allocated for the remote memory directory may be smaller than optimal. Application Ser. No. 10/403,157 is directed to providing a more optimal allocation of directory memory by using the number of nodes in the system as a determinant of the directory memory allocation.
However, depending on workload, an optimal allocation of the directory memory between the snoop directory and the remote memory directory can change. For example, during the day, the computer system might be running transaction data processing on a very large database, in which a relatively large amount of data in the address space of one node is needed by another node, and, therefore, a relatively larger proportion of the directory memory should be allocated to the remote memory directory. At night on the same computer system, perhaps a scientific, numerical intensive, application is run, in which most, if not all, memory requirements of a node are satisfied by memory residing on the node. In such case, an optimal allocation of the directory memory is to allocate very little space to the remote memory directory.
Improvements in software (such as upgrades) might also be a determinant of the optimal allocation of the directory memory. For example, a compiler might generate code that does not make good use of locality of reference. The compiler might scatter data over a wide addressing range, in which case, more data would be needed from the memory addressing space of other nodes in the computer system, and more of the directory memory is optimally allocated to the remote memory directory. If an improved version of the compiler is installed, improvements in use of locality of reference may occur, reducing the number of memory references to other nodes, and therefore reducing the optimal size of the remote memory directory.
In a scalable computer system, optimal allocation of the directory memory also depends on the number of nodes installed. For example, if only one node is installed, the remote memory directory is not needed at all. If a customer adds a second node, the remote memory directory is required, but need not (in most cases) be large. As a third, fourth, fifth (and so on) node is added, more and more data, typically, must be shared between nodes, and therefore, more remote directory entries are needed. The optimal allocation of the directory memory tends to provide for a relatively larger remote memory directory as more nodes are added. Today's computer customer wants processing power on demand. Nodes already physically in the computer system can be enabled dynamically during peak periods and disabled when not required. Directory memory allocation should respond to such enabling/disabling events.
In the emerging world of autonomic computing, the allocation of directory entries between uses would ideally be dynamic to optimize performance for changing workloads with changing locality of reference. For example, in a server application, the virtual machine system management may autonomically detect the need for more or less processors for a given partition running on the virtual machine. The system management would automatically add or subtract processors to that partition across nodes. When a partition crosses a node boundary, its memory locality of reference will change. In this situation, the allocation of directory memory between the snoop directory and the remote memory directory should be adjusted for optimal performance.
Therefore, a need exists to provide a dynamic node directory management system for a computer system having more than one processor per node, the computer system being scalable in the number of nodes installed or used.