The present invention relates to a shared memory parallel processor system used in information processing apparatuses, especially used in personal computers (PCs), work stations (WSs), and server machines. In particular, the present invention relates to a control scheme of a shared memory between partitions.
In recent years, use of the architecture of the shared memory multiprocessor as a host module of the parallel processors has spread. In this architecture, such a configuration where several tens to several hundreds processors share a main memory is needed in some cases in order to improve the performance. As the configuration method of the shared memory multiprocessor, bus connection symmetrical multiprocessors (SMPs) used in personal computers are typical. Since the bus throughput forms a bottleneck in the bus connection SMPs, however, the number of processors which can be connected is limited to approximately four. Thus the bus connection SMP is not suitable for such a scheme as to connect a large number of processors.
In order to solve the above described problem, there has been proposed a method of connection bus connecting SMPs hierarchically by using a crossbar switch or the like. A typical example of the hierarchical SMP is found in xe2x80x9cGigaplanexe2x80x94XB: Extending the Ultra Enterprise Familyxe2x80x9d, HOT Interconnects V, pp. 97 to 112, August 1997. The crossbar switch or the like between nodes logically functions as a bus. Coherence of a CPU cache between nodes of the bus connection SMP having processors and main memories can be managed at high speed by using a bus snoop protocol.
As one of problems of the large scale shared memory multiprocessor as described above, there is reliability. In conventional shared memory multiprocessors, the whole system has one Operation System (OS). Since all processors of the system can be managed by one OS, this scheme has an advantage that flexible system operation (such as load distribution) can be conducted. However, this scheme has a drawback that the system reliability falls in the case where a large number of processors are connected in a shared memory multiprocessor configuration. In a server of a cluster configuration in which a plurality of processors are connected by a network, respective nodes have different OSs. Even if a fatal error such as a bug or the like of an OS or the like occurs, only the corresponding node suffers from a system down state. On the other hand, if a certain processor is brought into a down state by a system bug or the like in the case where the whole system is controlled by one OS in a shared memory multiprocessor, the OS is brought into a down state and consequently all processors are affected.
In order to solve this problem, there has been proposed such a scheme that the inside of a shared memory multiprocessor is divided into a plurality of partitions and a plurality of OSs are run independently. Each partition has an independent main memory. A processor of a certain partition basically accesses only the main memory of its own partition. As a result, it becomes possible to realize the fault containment between partitions and improve the system performance.
Furthermore, also for improving the operation performance and reducing the management cost using server consolidation, it is desired to integrate works which have been conducted by a plurality of servers into one highly multiplexed server. The above described partition technique is indispensable.
In the case where a shared memory multiprocessor is divided into partitions, how communication is conducted between partitions poses a problem. A scheme in which communication between partitions is conducted by making efficient use of a shared memory mechanism provided in a system before partitioning is at advantage in performance. Therefore, realization of a shared memory between partitions becomes necessary.
A partition technique of making a plurality of OSs run in one system has been used heretofore in mainframes, and it has been disclosed in U.S. Pat. No. 4,843,541. In this scheme, it is possible to make a plurality of guest OSs operate under the management of a host OS which manages the whole system. Respective guest OSs are independent systems having different address spaces. Access to a main memory in each partition is conducted according to the following procedure.
(1) A virtual address of a guest is translated to a real address.
(2) The above described guest real address is translated to a main memory address in the host.
(3) The main memory is accessed by using the main memory address in the host derived in (2).
The above described address translation of the two stages must be conducted between a CPU and the main memory.
In the partitions of the main frame, it is made possible for respective guest partitions to have different address spaces and the fault containment is realized by conducting the above described address translation of the two stages. By overlapping addresses of guests in the address translation of (3), the shared memory can be realized.
In realizing a partition mechanism and an inter-partition shared memory of a hierarchical bus connection SMP by using the above described conventional techniques, there are problems described hereafter.
The conventional inter-partition shared memory mechanism is premised on a concentrated main memory architecture having an address translation mechanism of two stages between each CPU and the main memory. Therefore, the conventional inter-partition shared memory mechanism is largely different in architecture from the hierarchical bus connection SMP. Accordingly, the conventional technique cannot be applied to the hierarchical bus connection SMP as it is. In particular, respective CPUS use standard components. As a result, the address translation of the two stages used in the conventional technique cannot be conducted in the CPU, and relocation of the address of each partition (guest) cannot be conducted.
Furthermore, in the hierarchical SMP, the CPU cache coherence is kept at a high speed by using the bus snoop protocol. Therefore, the inter-partition shared memory mechanism needs to be capable of supporting the bus snoop protocol.
Therefore, an object of the present invention is to realize a partition mechanism and an inter-partition shared memory mechanism suitable for the architecture of the hierarchical SMP.
In addition, a future parallel system must support a general purpose OS. Accordingly, the partition system needs to have a general purpose architecture which does not depend on a specific OS. It is necessary to make it possible for each partition to have a free address space. In addition, it is necessary to realize dynamic generation and erasing of a partition in order to deal with a large number of applications and improve the reliability of the system by using dynamic reconfiguration of partitions.
Another object of the present invention is to flexibly manage the configuration of the inter-partition shared memory.
In addition, the partition system needs to realize high reliability at a low cost. Thus it is indispensable for partitions to back up each other. Therefore, a third object of the present invention is to facilitate recovery from an error from another partition in the case where the OS of a certain partition suffers from system down.
In order to achieve the above described first and second objects, such a hierarchical SMP that nodes each having CPUs coupled by a bus and a main memory are connected by a switch and cache coherence control is conducted through the switch, at the gateway of the switch from each node when the inside of the system is divided into partitions in each of which a different OS operates, with means for mutually translating an address of an access command for an area shared between partitions, between a real address used in a partition and an address used in common between partitions. As a result, the address of a local area of each partition is freely set. In addition, cache coherence control of the shared area can be conducted at high speed by using a snoop command of the hierarchical SMP.
Furthermore, in another preferred aspect of the present invention, conformity between the address of the access command issued from another partition and the configuration of the shared area is checked at the gateway of each node. As a result, fault containment can be realized between partitions.
Furthermore, in another preferred aspect of the present invention, there is provided apparatus for the system software to dynamically modify the configuration information of the shared area between partitions. As a result, flexible management of the shared area becomes possible.
In addition, in order to achieve the above described third object, each partition is provided with a function of resetting CPUs of other partitions. In the case where a certain partition suffers from system down, it is possible to reset and re-initialize the partitions which have suffered from system down.