1. Field of the Invention
The present invention relates to a computer system comprising a plurality of machines connected to a shared memory and a control method for a computer system comprising a plurality of machines connected to a shared memory. More particularly, the present invention relates to a computer system comprising a plurality of machines connected to a shared memory and a control method for the same system, wherein improved controls are attained than in conventional systems.
Recently, computer systems in which a plurality of machines are connected to each other via a shared memory are generally used because of a reduced rate of evolution in the capability of a single processor and a strong need for improvement in reliability. There is also a demand for operating a computer system as a plurality of virtual computer systems, by using a shared memory.
Further, there is a demand for a system having a hot-standby capability, wherein it is possible to detect a system down occurring when a system operated under an AVM (an OS for controlling a virtual computer system) is down due to an abnormality.
2. Description of the Prior Art
(1) Conventional computer systemxe2x80x941
A description will first be given of a first conventional computer system.
FIG. 1 shows the construction of a first conventional computer system.
In the example of FIG. 1, a machine 10 is operated as a plurality of virtual machines 11-1-11-n (hereinafter, referred to as logic machines). The machine 10 has an operating system (hereinafter, referred to as an AVM) 12 for controlling the logic machines 11-1-11-n.
(2) Conventional computer systemxe2x80x942
Secondly, a description will be given of a case where a computer system is connected to a shared memory.
FIG. 2 shows the construction of a second conventional computer system.
In the example of FIG. 2, the above-mentioned machine 10 is connected to a shared memory 50. The machine 10 and the shared memory 50 is connected via a real access path 60 provided in the shared memory 50. The machine 10 reads information from and writes information to the shared memory 50.
The machine 10 is provided with the AVM 12 and the plurality of logic machines 11-1-11-n. A logic (virtual) access path 71 is disposed between the AVM 12 and each of the logic machines 11-1-11-n. The logic machines 11-1-11-n read information from and writes information to the shared memory 50 via the access path 71 and the AVM 12.
FIG. 3 is a diagram explaining the second conventional computer system.
In the second conventional computer system shown in FIG. 3, the machine 10 is connected to a shared memory 51 via an access path 61, and a machine 20 is connected to the shared memory 51 via an access path 62. A machine 30 is connected to a shared memory 52 via an access path 63, and a machine 40 is connected to the shared memory 52 via an access path 64.
The machine 10 connected to the shared memory 51 is executing a process with respect to the shared memory 51. One of the logic machines in the machine 20 is in a standby state under the control of the AVM, and another in the machine 20 is used in developing a computer system. The machine 30 connected to the shared memory 52 is executing a process with respect to the shared memory 52, and the machine 40 is in a standby state. In this way, exclusive control is imposed when the system shown in FIG. 3 is in a hot-standby mode such that, while one of the machines 10 (30) is executing a process with respect to the shared memory 51 (52), the other machine 20 (40) is in a standby state.
(3) Conventional computer systemxe2x80x943
Thirdly, a description will be given of a case where a plurality of machines are connected to a shared memory.
FIG. 4 shows the construction of a third conventional computer system.
In the computer system shown in FIG. 4, a plurality of machines 10, 20, 30 and 40 are connected to the shared memory 50. The machines 30 and 40 are operated as virtual machines. Logic machines in each of the virtual machines 30 and 40 are provided with a relative machine No. For example, the logic machine 31-1 is provided with an No. 1, the logic machine 31-2 an No. 2, the logic machine 31-3 an No. 3, and the logic machine 40 an No. 4. Likewise, the logic machine 41-1 of the virtual machine 40 is provided with an No. 1, the logic machine 41-2 an No. 2, the logic machine 42-3 an No. 3, and the logic machine 41-4 an No. 4. Further, the machine 10 is provided with a real machine No. 0, the machine 20 a real machine No. 1, the machine 30 a real machine No. 2 and the machine 40 a real machine No. 3.
A description will be given of a case where an operator 80 specifies the logic machine 31-1 of the machine 30. When the operator 80 specifies the real machine No. 2 of the machine 30, it means that an AVM 32 of the machine 30 having the real machine No. 2 is specified. According to a predetermined sequence, the AVM 32 specifies a relative machine No. 1, for example, indicating the logic machine 31-1 of the machine 30. In a computer system comprising a plurality of machines connected to each other via a shared memory, a virtual machine operated under the AVM allows only one logic machine under its control to be connected to another computer. Since, the real machine No. and the logic machine are in one-to-one correspondence at a given moment, it is possible to specify a logic machine by specifying a real machine No. When the operator 80 specifies the real machine No. 2, for example, it means that the logic machine 31-1 is specified.
(4) Communication method in a conventional computer system
Fourthly, a description will now be given of a communication undertaken between the machines in a conventional computer system.
FIG. 5 is a diagram explaining communication system of a third conventional computer system. As shown in FIG. 5, the plurality of machines 10, 20 and 40 and the like share the shared memory 50. Communication between the machines via the shared memory 50 is executed such that an originating machine specifies a real machine No. of a destination machine. For example, assuming that the machine 10 has a real machine No. 0 and the machine 20 has a real machine No. 1, the machine 10 requests communication with the machine 20 by specifying the real machine No. 1. The machine 40 is provided with a plurality of logic machines 41-1-41-n. It is possible for the logic machine 41-3 of the machine 40 to communicate with the machine 20 via an AVM 42 and the shared memory 50, by specifying the real machine No. 1 of the machine 20. In this way, communication among the machines 10, 20 . . . via the shared memory 50 is possible by specifying the real machine No.
(5) Interruption handling in conventional communication
A description will now be given of interruption handling effected in conventional communication.
In the above-described system in which a plurality of machines share a shared memory, communication between virtual machines is possible using a GSIGP instruction. In order to keep track of how an interruption is pending or reflected (processed), communication process as shown in FIG. 6, which process is based on the pending status of an interruption, is conducted. GSIGP instructions have the function of allowing communication between machines and controlling remote machines. The function of controlling remote machines is taken advantage of when a downed machine is to be controlled. In this case, a GSIGP instruction is used to halt the operation of a CPU, reset the I/O and to begin a memory dump.
FIG. 6 is a sequence chart explaining interruption handling in conventional communication.
It is assumed that the machine A communicates with the machine B.
step 1) The machine A issues a GSIGP instruction for requesting communication to the machine B via a shared memory.
step 2) The machine B puts the interruption in a pending state by hardware means because the interruption is not reflected.
step 3) Upon an occurrence of a next communication request, the machine A issues a GSIGP instruction via the shared memory.
step 4) Upon a determination that the interruption is pending in the machine B, the shared memory receives the GSIGP instruction from the machine A, assuming that the interruption will be reflected by the machine B later, and queues the communication request from the machine A.
step 5) When the machine B is ready to reflect the interruption, the hardware of the machine B cancels the pending state of the interruption and causes the interruption to be reflected.
step 6) The machine B processes the pending communication request and the communication request queued in the shared memory.
Communication requests may be queued in the shared memory so that a plurality of communication requests may be processed in the event of an interruption.
(6) Conventional system control
In a conventional complex system in which machines are connected to each other via a shared memory, system control involving resetting of a downed machine by another machine, using a GSIGP instruction (reset), is enabled. When such a resetting is completed, the system switching according to hot-standby scheme is conducted.
FIG. 7 is a sequence chart explaining resetting process in conventional system control.
In the description that follows, it is assumed that the machine A controls resetting of the machine B.
step 10) The machine A issues a GSIGP instruction (reset) in order to reset the machine B.
step 11) The machine B begins its resetting by hardware means upon receipt of the GSIGP instruction via the shared memory.
step 12) The machine A issues a GSIGP instruction (sense) to determine whether the resetting is completed or the machine B is in the process of its resetting.
step 13) The machine A recognizes that the machine B is in the process of its resetting based on the result yielded in response to the GSIGP (sense) instruction.
step 14) The machine B completes its resetting by hardware.
step 15) The machine A issues a GSIGP instruction (sense).
step 16) The machine A recognizes that the machine B has completed its resetting based on the result yielded in response to the GSIGP instruction (sense).
(7) Conventional process executed conventionally in the event of a system down
Conventionally, when an OS detects a system down, the OS controls the downed machine by a GSIGP (halt CPU) or a GSIGP instruction (reset). A GSIGP instruction is honored and executed by a service processor (SVP) provided in each machine. The service processor assumes that GSIGP instructions (pending) may continue to be issued endlessly and effects forced resetting when a predetermined period of time has expired.
In case a logic machine is deactivated, an AVM recognizes the deactivation of the logic machine and removes the logic machine from the service by disconnecting a logic path between the logic machine and the AVM.
However, the aforementioned aspects (1)-(7) of the conventional system have the following problems.
(1) In the systems shown in FIGS. 1 through 3, a machine is provided to stand alone or connected to a shared memory directly. While it is possible for a machine in the latter configuration to exchange information with other machines, there is a problem in that the access of only one machine under the control of an AVM to the shared memory is enabled.
(2) When initialization by a plurality of machines is executed using an IPL, there is a likelihood that initialization is executed by the plurality of machines simultaneously, with the result that data compatibility may suffer. When there is a hang-up in the OS that mediated the IPL operation so that the OS is restarted, an erroneous operation of the system may result.
(3) In the system shown in FIG. 4, the AVM is capable of specifying a logic machine in correspondence to the specification by the operator and in accordance with a predetermined order. However, in the system shown in FIG. 8, wherein no predetermined specification order is determined for a plurality of logic machines in a virtual machine, specification of a logic machine only by means of a real machine No. is impossible. The same thing is true of a system in which logic machines in a virtual machine are operated concurrently. Referring to FIG. 8, when the operator 80 specifies a real machine No. 2 of the machine 30, the control is turned over to the AVM 32 of the machine associated with the real machine No. 2. However, the AVM 32 cannot determine which logic machine included in the machine 30 is to be specified. Therefore, it is impossible to process communication involving a specific logic machine.
(4) In a conventional scheme, only one logic machine out of a plurality of logic machines in a virtual machine connected to the shared memory can use the shared memory. That is, in a complex system where a plurality of machines operated as virtual machines each comprising a plurality of logic machines are connected to each other via a shared memory, it is impossible to specify a logic machine controlled in each virtual machine by real machine Nos.
(5) When a communication request is issued from a real machine to a virtual machine in a conventional configuration where real machines and virtual machines share a shared memory, an associated interruption is reflected by the originating machine before the communication request is reflected by the destination virtual machine. Therefore, the interruption does not become pending in the hardware of the destination machine. Accordingly, the other machines in this system recognize that this interruption is reflected. When the destination virtual machine is not ready to reflect an interruption, it is impossible for a real machine to keep track of the state of the virtual machine with regard to interruptions. Specifically, even if the virtual machine is in a pending state, requests to the virtual machine arrive one after another at the central program of the originating machine, causing the central program to discard the interruption. As a result, communication having as its destination a virtual machine cannot be executed properly.
When a communication request is queued in the shared memory due to the pending state of a first virtual machine and then an interruption requesting a second virtual machine occurs subsequently, the first interruption is not reflected readily because the destinations are different.
(6) When a GSIGP instruction (for CPU) is issued in a conventional configuration system where a plurality of virtual machines share a shared memory, as is done between machines operated as real machines, the CPU of the machine operated as a virtual machine comes to a halt. Even if the xe2x80x9cinterruption control method for communication between computer systemsxe2x80x9d disclosed in Japanese Laid-Open Patent Application No. 5-324362 is applied to a system in which a plurality of virtual machines share a shared memory, the machine that originated a GSIGP instruction (reset) may be unable to recognize a reset completion properly. Specifically, while a logic machine (for example, a logic machine a) in a virtual machine is being reset in response to a GSIGP instruction (reset) issued by a machine (for example a machine A), a request for resetting or communicating with a logic machine b operated under a same control program as the logic machine a, which request is issued by another machine (for example, a machine B), cannot be processed. In order to resolve this situation, the control program may instruct the hardware to cancel the xe2x80x9creset proceedingxe2x80x9d state when the control program initiates resetting of the AVM. However, this has a problem in that the machine that originated the GSIGP instruction (reset) is unable to recognize completion (cancellation of the xe2x80x9creset proceedingxe2x80x9d state) of the reset properly. FIG. 9 is another diagram explaining a problem with the prior art. Encircled numbers in FIG. 9 correspond to the numbers in the parentheses at the beginning of each description below.
(1) A machine A issues a GSIGP instruction (reset of logic machine a) for resetting a logic machine a in a machine V operated as a virtual machine.
(2) The reset request from the machine A is put in a pending state in the machine V by the hardware thereof.
(3) When the AVM of the machine V recognizes the reset request, the AVM executes a resetting process of the logic machine a.
(4) A machine B issues a GSIGP instruction (reset of logic machine b) for resetting a logic machine b in the machine V.
(5) Since the hardware of the machine V is executing the resetting process, the reset request from the machine B is not honored.
(6) When the resetting of the logic machine a is completed, the AVM instructs the hardware to cancel the pending state of the reset request.
(7) The machine A recognizes completion of the resetting of the logic machine a of the machine V.
In other words, while a resetting process is being executed in a virtual machine in response to a reset request issued by a first machine, it is impossible for a second machine to request resetting of or communication with the virtual machine operated under the AVM executing the resetting process.
FIG. 10 is yet another diagram explaining a problem with the prior art. Encircled numbers in FIG. 10 correspond to the numbers in the parentheses at the beginning of each description below.
(1) A machine A issues a GSIGP instruction (reset) for resetting a logic machine a in a machine V operated as a virtual machine.
(2) The hardware of the machine V holds the reset request according to the method disclosed in Japanese Laid-Open Patent Application No. 5-324362 (xe2x80x9cinterruption control method for communication between computer systemsxe2x80x9d).
(3) The AVM of the machine V recognizes the reset request and instructs the hardware to cancel pending state of the reset.
(4) The AVM executes the resetting process of the logic machine a.
(5) The machine A recognizes completion of the resetting process of the logic machine a of the machine V.
Thus, there is a problem in that the machine that originated the GSIGP instruction (reset) erroneously recognizes completion (cancellation of the xe2x80x9creset proceedingxe2x80x9d state) of the resetting process if the AVM of the machine V instructs the hardware to cancel the resetting process.
Further, if a logic machine in a machine operated under the AVM is down in a system configuration in which a plurality of machines are connected via a shared memory, a service processor provided in each machine is unable to perform a forced reset normally carried out when a resetting process is not completed within a predetermined period of time. This is because the service processor activates a forced reset when it recognizes a pending state. Cancellation of the xe2x80x9creset proceedingxe2x80x9d state by the AVM causes the monitoring of the service processor using a timer to stop. Since the service processor is unable to recognize a pending state, it is unable to perform a forced reset.
Further, when a logic machine operated under the AVM is deactivated, a logic path between the AVM and the logic machine is disconnected so that the OS cannot control the logic machines. For example, the OS cannot reset a logic machine. For this reason, there is a problem in that an operator has to reset a logic machine whenever the logic machine is deactivated.
Accordingly, an object of the present invention is to build a large-scale computer system in which loads must be distributed across a plurality of machines, by connecting a plurality of computer systems to a shared memory that allows communication between a plurality of machines operated as a virtual machine and having a plurality of logic machines.
Another and more specific object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein, when an initialization of the shared memory is started, an updating request from another machine is subjected to exclusive control, and, when one of the machines is put out of service, system failure due to, for example, a restart is prevented.
Still another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein the AVM of a machine operated as a virtual machine assigns machine Nos. to logic machines under its control so that both the AVM and the control system of the logic machine are capable of recognizing the computer Nos.
Still another object of the present invention is to allow flexible communication in a system in which a plurality of machines are connected via a shared memory.
Yet another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein an interruption to the AVM is properly reflected.
Another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein completion of a resetting process in response to a GSIGP instruction (reset) is properly recognized by a machine that originated the instruction.
Still another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein it is possible to detect an abnormality of the AVM and notify other machines of the abnormality, and it is possible for a machine to control a real machine that went down or the logic machines in the machine operated under the AVM in which the abnormality is found.
Yet another object of the present invention is to provide a plurality of machines connected to a shared memory, wherein it is possible for a machine that went down to notify other machines connected to the shared memory of a state of the system down.
Another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein, when there is a machine in which a system down occurs, it is possible for the OS of another machine to recognize the system down.
Still another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein, a machine that recognizes a downed machine is capable of controlling the downed machine.
Yet another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein it is possible to halt the operation of the CPU of a downed machine and reset the I/O thereof so that a hot-standby state is introduced.
Another object of the present invention is to provide a computer system comprising a plurality of machines connected to a shared memory, wherein control of a downed machine is effected such that, when a virtual machine goes down, it is possible to force a reset thereof using the hardware, similarly to the case of a real machine.
Still another object of the present invention is provide a computer system comprising a plurality of machines connected to a shared memory, wherein an operator interruption message once displayed is extinguished immediately when the operator completed its interruption.
In order to achieve the aforementioned objects, the present invention provides a computer system comprising a plurality of machines connected to a shared memory, the computer system including at least one real machine and/or a plurality of virtual machines, wherein each of the virtual machine is provided with an AVM for controlling a virtual machine, and the real machine is provided with an OS for controlling the real machine and individual logic machines in the virtual machine.
In accordance with one aspect of the present invention, it is possible to connect a plurality of machines operated as virtual machines to a shared memory, wherein communication between a virtual machine operated in a machine and a virtual machine operated in a separate machine is possible.
In accordance with another aspect of the present invention, it is possible for a virtual machine connected to a shared memory to lock a shared memory in a hot-standby mode so that an access from another machine is subject to exclusive control. When an abnormal halt of the operation of the locking machine is detected, it is possible for a detecting machine to initialize the failed machine so that an access path between the failed machine and the shared memory is disconnected. Thereby, disruption of data is prevented.
In accordance with still another aspect of the present invention, when an initialization process of a logic machine in a virtual machine is halted due to an abnormality, a logic path between the failed logic machine and the OS (AVM) that controls the failed logic machine is disconnected. Since an access path for the virtual machine is not disconnected, it is possible for the other logic machines under the same AVM to access the shared memory.
In accordance with yet another aspect of the present invention, identifiers are assigned to each of a plurality of logic machines operated in a virtual machine connected to a shared memory. Thus, upon being started, the OS of the logic machine inquires, as required, the identifier of itself of the AVM that controls the logic machines. Therefore, it is possible to identify a call originating logic machine and/or a call receiving logic machine in communication involving a logic machine.
In accordance with another aspect of the present invention, a determination is given in communication as to whether the destination machine is a real machine or a machine operated as a virtual machine. Information necessary for communication between a plurality of machines is exchanged so as to keep track of the operating state of the machines. Therefore, it is possible to effect communication wherein the state and identity of the communication destination is properly recognized, even in a complex system where real machines and virtual machines are connected to via a shared memory.
In accordance with another aspect of the present invention, when a virtual machine receives a request for communication from another machine, a determination is made by the virtual machine as to whether or not the request has as its destination the virtual machine or another machine. When the request has as its destination the virtual machine, the virtual machine queues the communication request under the control of the OS until the logic machine requested is ready for a communication process. When the logic machines is ready for the process, the communication request is reflected.
In accordance with still another aspect of the present invention, communication requests for a plurality of logic machines are queued. When a logic machines is ready for the communication, the logic machine is notified of the request. The OS of the notified logic machine determines whether or not the there are other communication requests queued. When there are, those communication requests are also processed. In this way, communication requests are properly reflected even when there is a new interruption, or different communication interruption states exist between the requested logic machine and the machine which controls the same logic machine.
In accordance with yet another aspect of the present invention, when there is an overflow of a queue, the OS controlling the requested logic machine notifies the machine that originated the communication request of the overflow of the queue so that a new communication request is prevented from being issued.
In accordance with another aspect of the present invention, when a virtual machine receives a reset request from another machine, the OS controlling the logic machines determines the logic machine that is the target of control. When the control of the logic machine is completed, the OS notifies the machine that originated the reset request of the completion of the control. Therefore, the machine that issued the reset request is able to switch to a hot-standby mode.
In accordance with still another aspect of the present invention, it is possible for the OS controlling a logic machine to receive a communication request from another machine while the OS is executing the resetting process.
In accordance with yet another aspect of the present invention, when the AVM is down due to an unrecoverable error, the failed virtual machine itself is able to notify the other machines connected to a shared memory of the system down via the shared memory. Thereupon, it is possible for one of the other machines connected to the shared memory to control the downed machine; i.e., reset the input and output of the downed machine or halt the operation of the CPU of the downed machine. Instead of allowing a communication session between the downed machine and the machine that received the system down being continued even after the occurrence of the system down, thereby inviting a chance of an error occurring, this aspect of the invention ensures that the downed machine is disconnected from the system logically.
In accordance with another aspect of the present invention, the state of the downed machine can be properly recognized. Specifically, the state of the downed machine in which a failure occurred during a communication session can be properly recognized. Therefore, it is possible to control the downed machine; i.e. reset the input/output of the downed machine or halt the operation of the CPU of the downed machine. Thus, it is ensured that the downed machine is disconnected from the system.
In accordance with still another aspect of the present invention, the OS waits for a notification of completion of a control from the AVM according to a timer monitoring scheme. When there is no completion notification or failure notification from the AVM within a predetermined period of time, it is determined that a logic machine in the machine operated under the AVM is down.
In accordance with yet another aspect of the present invention, when a logic machine in a virtual machine is down, the logic machines other than the downed logic machine are also regarded as being down. Thus, all the logic machines in the virtual machine are disconnected from the system. In this way, a high-speed stand-by process becomes possible by effecting control with respect to the hardware of virtual machine.
In accordance with another aspect of the present invention, halt of the CPU operation and the reset of the input/output are allowed in a virtual machine. While the I/O reset control according to the prior art is possible only in a real machine not operated under the AVM, the present invention allows such a control in the virtual machine as well.
In accordance with still another aspect of the present invention, when the OS detects a down of the AVM and when there are any other machine (which may be a virtual machine) other than the logic machines operated under the AVM, the other machine takes control of the downed machine. Alternatively, if there are no machines other than the logic machines operated under the downed AVM, one of the logic machines under the control of the AVM takes control of the downed machine.
In accordance with yet another aspect of the present invention, when a logic machine in a virtual machine is deactivated, the virtual machine notifies another machine of the deactivation. However, the AVM for the deactivated logic machine is responsible for controlling the downed machine or resetting the deactivated logic machine. Therefore, control by means of GSIGP instructions is not necessary. In this way, it is possible to control the deactivated machine; i.e., reset the I/O thereof or halt the operation of the CPU thereof.
In accordance with another aspect of the present invention, it is possible for a machine to receive a notification of a deactivation of another machine.
In accordance with still another aspect of the present invention, an operator interruption message displayed when the control of the downed machine fails is extinguished so as to reduce the degree of operator interruption. Accordingly, a hot-standby mode is attained without an interruption by an operator.