In server systems being operated on a backbone system, an operation with high availability and flexible allocation of resources (hardware resources) is needed. As a technique to achieve such high availability and flexible allocation of resources, a function known as “multi-domain” or “multi-partition” has been used in a server system in which a single system is divided into multiple domains (partitions), and a respective operating system is executed on each of the domains.
Another technique known as “virtual machine” (VM) has also been used in which a single system is operated as if it is multiple systems (VMs) with the assistance of software and/or firmware (sometimes, assistance by hardware may be needed), and an OS is executed on a respective VM. On the contrary, in a domain system, most of each domain is “physically” independent.
FIGS. 50 and 51 are drawings illustrating an exemplary configuration of a multi-domain system in a server system, wherein FIG. 50 is a diagram illustrating the system prior to establishment of domains, and FIG. 51 is a diagram illustrating the system following establishment of domains.
A server system 100 depicted in FIGS. 50 and 51 is configured to include a common unit 101, CPUs (Central Processing Units) 102-1 and 102-2, memories (MEMs) 103-1 and 103-2, input/outputs (I/Os) 104-1 and 104-2, and multiple domains can be established by combining these components, namely, the CPUs 102-1 and 102-2, the memories 103-1 and 103-2, and the I/Os 104-1 and 104-2 in various combinations.
For example, as depicted in FIG. 51, the CPU 102-1, the memory 103-1, and the I/O 104-1 are combined to establish a domain D1, or the CPU 102-2, the memory 103-2, and the I/O 104-2 are combined to establish a domain D2. Alternatively, in the domain system, the configuration of a domain that has been previously established can be modified, e.g., a CPU in any location in the system may be assigned to the domain, or any number of CPUs may be assigned to a single domain.
Although the example depicted in FIGS. 50 and 51 provides an ideal multi-domain system, in most of cases, multiple CPUs, or a CPU and a memory are mounted on a single board and such combination for the mounting may be a limiting factor.
Furthermore, in the server system 100 which allows establishment of domains, in order to permit any modification of domain configuration, there exists the common unit 101, such as a cross bar, clock, or the like, which is shared among the multiple domains.
The server system 100 is adapted to minimize the common unit 101 and/or to provide redundancy and duplication in order to avoid a situation where all domains are down simultaneously.
Although the system is configured so that any fault, such as a failure, in the portion of the system other than the common unit 101 typically does not affect other systems, it is highly possible that the common unit 101 may bring down all of the domains. For example, especially when operated in a high frequency, it is difficult to switch the cross bar, clock, or the like, to the normal running system or to degenerate the affected system without causing any down of the domain.
FIG. 52 is a diagram illustrating an example when a fault occurs in a site other than a common unit in a multi-domain system, and FIG. 53 is a diagram illustrating an example when a fault occurs in the common unit 101 in the multi-domain system.
In a conventional multi-domain system, as depicted in FIG. 52, for example, when a failure occurs in the CPU 102-1 in the domain D1, only the domain D1 is brought to system down (partial degeneration) while continuing the operation of the domain D2, thereby shutting down only the domain D1 that has been affected by the failure.
On the contrary, in the multi-domain system, as depicted in FIG. 53, although both the domains D1 and D2 are brought to down in many cases when a failure occurs in the common unit 101, the fault site may affect only a particular domain even when the failure occurs in the common unit 101.
As described above, even when a fault occurs in the common unit 101 and the fault site is only related to a particular domain, conventional multi-domain systems are configured to give a higher priority to continue the operation of the (surviving) domain that is not affected by the failure without carrying out degeneration on the common unit 101 that may cause the entire system down.
However, some users set different significances on different domains that are established. In such a case, when a failure occurs in the common unit 101 which brings a domain having a higher significance into down, recovery of the highly significant domain may be delayed since a conventional multi-domain system gives a higher priority to continue the operation of the surviving domain.