The idle time spent by computer processors while waiting for memory references to complete has become a much larger fraction of the total execution time for a wide variety of important commercial and technical computing workloads. Many prior-art techniques have been used in multiprocessor system designs to minimize the time a processor must wait while the access of main storage locations is completed. These techniques fall broadly into two categories. The first category of techniques attempts to find additional instructions for the processors to execute while waiting for the memory reference which is experiencing a delay. These techniques include such hardware and software mechanisms as out of order execution and multithreading. The second category of techniques focuses on minimizing the latency of the memory reference itself, e.g. SRAM caches, DRAM caches and high speed multiprocessor bus architectures. SRAM and DRAM caches have been extremely successful in reducing memory reference latency and one or both are used by all current multiprocessor designs. Prior-art cache designs include specialized hardware and software which maintain cache coherence for multiprocessor systems. For systems which connect a plurality of processors via a shared bus, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or “snooped”) against data in the caches of all other devices attached to the bus. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.
Although caches have worked well for multiprocessor systems with a moderate number of processors, prior-art multiprocessor designs do not scale well when extended to large numbers of processors for many important workloads including the transaction and database workload simulated by a TPC-C benchmark.
Logical partitioning, as described in U.S. Pat. No. 4,843,541, when using shared processors also causes poor scaling for prior-art system designs when extended to large numbers of processors. U.S. Pat. No. 4,843,541 shows how a virtual machine hypervisor program can be used to “partition the resources in a central electronic complex of a data processing system into a plurality of logical partitions”. Logical partitioning is widely used on large multiprocessor systems to run many workloads that operate on private data simultaneously. In a typical system employing logical partitioning, an operating system instance is initialized within each logical partition. The logical partition can have from 1 to n logical processors. The hypervisor is responsible to dispatch each of the logical processors onto a physical processor. If a physical processor is the host of just a single logical processor over a long period of time it is said to be “dedicated” to that logical processor's partition. If a physical processor is the host of the logical processors from multiple partitions it is said to be a “shared” processor. It is desirable, from an overall hardware utilization point of view, for a large multiprocessor system to allow the flexibility of defining many or most of the physical processors as “shared” and allowing the movement of logical processors among the physical processors of the multiprocessor as the utilization of the physical processors fluctuates with external changes. Prior-art multiprocessor cache designs do not scale well for these partitioned workloads, especially when the physical processors are defined as “shared”.
A large factor in the poor performance scaling of large multiprocessors for both the large single database workload and the shared logical partition case is the relationship between increasing numbers of processors and the time delay required to communicate among them. Snoop bus protocols require memory references that miss local caches to be broadcast to all caches which may contain a copy of the requested lines, typically all other caches in the system. The bus bandwidth required to distribute the addresses and responses for large multiprocessor systems is very high. The need to provide the required high bandwidth has driven prior-art designs to use switch chips with many wide ports, expensive chip carriers to provide the needed pins, expensive card technology to provide good electrical characteristics and therefore high speed buses, expensive card connectors to provide wide buses etc. The cost of all these elements has become a significant problem when trying to improve the cost/performance of large multiprocessor systems.
Prior-art designs have attempted to solve these two problems, coherency operation latency and address bandwidth limitations, in many different ways but each has imposed other costs on the system design which the current invention seeks to avoid.
Large shared caches, as exemplified in the IBM S/390 G4 design (IBM Journal of Research and Development Volume 41, Numbers 4&5, 1997) have been used in prior-art designs to address both problems. The interconnection of a few large shared caches does provide good latency for requests which hit in the shared cache. The inclusive shared cache also acts as a filter which eliminates the need to broadcast addresses to all of the processors in the system for some cases. The design does not scale well to large numbers of processors. The use of additional processors drives the design to using large multichip modules with many wiring layers and L2 cache chips with an extremely large number of I/O required to provide a port for each of the processors connected.
Multiprocessor systems which rely on directories to track the access of local memory by remote requesters, as exemplified by the Sequent NUMA-Q design (“STiNG: A CC-NUMA Computer System for the Commercial Marketplace”, in Proc. 23rd International Symposium of Computer Architecture, May 1996) work to reduce the address bandwidth required for large numbers of processors. They do so at the expense of large RAM directories and an increase in protocol complexity and hardware support. This type of design also depends upon an assumption that the majority of the main storage lines referenced by a particular software process is located on the same physical node as the node that the processor that is executing the workload is currently dispatched on. There are severe performance penalties for cases where a workload is accessing a large number of remote lines since the number of lines that can be “checked out” by remote nodes is limited by the size of the NUMA directories. One goal of the current invention is to allow the movement of the execution of a workload quickly and easily among many processors without the need to move main storage contents and without significant performance degradation.
Hagersten et al., U.S. Pat. No. 5,852,716 describes the use of multiple address partitions in order to define cache coherent operations which are either “local” and confined to a subset of processors in a large multiprocessor or “global” and therefore broadcast to all processors. A local transaction in Hagersten is defined as one which has physical memory allocated to the same subset of processing nodes as the subset to which the processor which originates the storage request belongs. The description beginning on in 63 of column 7 of U.S. Pat. No. 5,852,716 makes it clear that this prior-art invention does not allow the movement of a process between what is referred to as “local domains” without either moving the physical storage associated with that process or by changing the addressing mode to “global”.
We have determined that there is a need for techniques to reduce transmission of address requests between various processors in a multiprocessor computer system without using large amounts of SRAM directory and without requiring the movement of main storage contents. In developing solutions for fulfilling this need we have determined that there is an associated need to reduce the latency of all storage reference transactions in large multiprocessor systems.