The present invention relates to an operation control method for computer systems of the server class, and relates in particular to a server system operation control method to implement high-speed processing typified by failover processing during system problems and cloning processing during high loads to enhance operability and reliability within the same system.
Businesses operating on the Internet may lose business opportunities directly due to being unable to access the system when down or poor response time caused by sudden increase in a server system access. Methods typified by failover and cloning that improve operability have already been proposed as techniques to shorten these time losses as much as possible.
The referred term xe2x80x9cfailoverxe2x80x9d is a method to switch from the present main system to a standby system and have the standby system take over the processing, when a problem has occurred in the present system processing. The referred term xe2x80x9ccloningxe2x80x9d is a method used when the processing of the main system is subjected to heavy loads such that when processing has backed up (delayed) in the main system, a standby system shares a portion of the processing load.
Specific examples of these methods are described in xe2x80x9cSun ((trademark)) Enterprise ((trademark)) Cluster Failoverxe2x80x9d white paper issued by Sun Microsystems Inc.
The structure of the server system based on the technology of the related art is shown in FIG. 2.
In this figure, the reference numeral 202 denotes the main server system in charge of the normal processing in this system. Reference numeral 203 denotes the standby server system to take over the processing when an error has occurred in the main server 202.
Reference numeral 204 is a shared disk which is shared by the main server 202 and the standby server system 203. Reference numeral 205 is a network, such as a LAN or the Internet.
Numeral 201 is a client terminal for accessing the server system by way of the same network 205 and requesting processing.
As shown in this drawing, functions such as failover and cloning are implemented in the related art assuming the sharing of information between the main server 202 and the standby server system 203 by the shared disk 204 in a cluster type system.
The take-over processing in the server system shown in FIG. 2 is now described while referring to FIG. 3.
In FIG. 3, the time-wise process flow from top to bottom in the mutual interaction among the client terminal 201, the main server system 202, the standby server system 203 and the shared disk 204 that make up the main elements in this processing is shown.
First of all, just as shown in the processing request and normal response 301, during the normal operation, the main server system 202 performs processing according to the processing request from the client terminal 201 and the results are sent back to the client terminal 201 as the response.
This processing is repeated as processing requests are generated from the client terminal 201.
The present server save processing 302 is also conducted during normal operation.
If the main server system 202 is unable to respond to any inquiries due to problems with the hardware or the OS (operating system) or software such that the status information in its main memory cannot be searched, the main server system 202 writes its own required status information on the shared disk 204 at a specified timing.
This process could be constantly performed every time an event caused by a change in status occurs. However, the overhead required for accessing the disk is generally high and there are problems with the main server system 202 processing capability such that this solution is not practical.
The main system operation check request (hereafter xe2x80x9cmain system operation checkxe2x80x9d) as well as the correct response 303 operation, are operation monitoring processes of the main server system 202 run by the standby server system 203. These processes also run during the normal operation.
The communication to check operating status of the main server system 202 is performed at each timing specified by the standby server system 203. The main server system 202 responds to this communication with a reply that there are no errors and a check is made to ensure that the main server system 202 is operating correctly.
In the figure, 304 indicates a point where a problem has occurred in this main server system 202.
Operation 305 is an operation status check of the main server system 202 made by the standby server system 203 after a problem first occurs, which indicates the standby server system 203 has detected the occurred error.
The error response shows a case that there is absolutely no response or the response is delayed due to an error.
In operation 306 on the other hand, after a problem occurs, the standby server system 203 performs a take-over processing, and the operation from the processing request issued from the client terminal 201 until the main server system 202 processing is taken over by the standby server system 203 is shown.
Here, an error response indicates that a response is not returned within a specified time.
The standby server system 203, having detected a problem in the main server system 202 in the operation 305, commences the take-over processing as shown in operation 307. In that process, in order to restore the processing of the main server system 202, the status information stored in operation 302 by the main server system 202 on the shared disk 204, is loaded from the shared disk 204 in operation 308.
The standby server system 203 restores the processing status of the main server system 202 by using this status information. After preparing to take over the processing from the main server system 202, it completes the take-over processing in operation 309.
The standby server system 203 then starts processing as the main server system as shown by the configuration of operation 310. As a result of operation 306, it responds to reprocessing requests from the client terminal 201 and other processing requests.
These methods of the related art have the following problems and are unable to meet user needs for high operability.
(1) Restoring the main server system 202 processing by using the standby server system 203 required time for accessing the shared disk 204 and for performing the processing.
(2) The most recent information present on the main memory of main server system 202 when the system problem occurred, did not appear in the shared disk 204 or is impossible to load such that there are limits on how far back status could be restored.
The present invention therefore has the object of resolving the above described problems and to provide a system of high operability by shortening access failures and response times by failover and cloning, etc.
A server system operation control method of the present invention using a single shared memory type multiprocessor system made up of plural processors, a main memory device, an external memory device and a single shared main memory multiprocessor and a connection means for mutually connecting these components is characterized in that,
at two logical units are defined with each unit made up of any number processors and a portion of a main memory device, one logical unit is defined as a main logical unit and the other is defined as a standby logical unit; a memory segment is provided on the main memory device to be accessible from both the main logical unit and the standby logical unit and, an information storage space is provided on the memory segment to store information for take-over of control from the main logical unit to the standby logical unit; and
the main logical unit stores information required for take-over of control to the information storage space as the information is made, and
the standby logical unit searches information stored in the information storage space when a take-over request is sent to the standby logical unit from the main logical unit and forms a processing environment and state identical to the main system, and then takes over all or a portion of the processing of the main logical unit.
The present invention is further characterized by a standby logical unit to take over control from the main logical unit, so that at a point in time where the standby logical unit receives a request from the main system to take over control, the standby logical unit searches information stored in the information storage space and, based on the information obtained from the search results, accesses the main memory resources controlled by the main logical unit, forms a processing environment and state identical to the main system by storing the main memory resources on a main memory device controlled by the standby logical unit, and afterwards takes over all or a portion of the processing of the main logical unit.
The main logical unit of the present invention is further characterized in that a plurality of memory areas on a main memory area controlled by the main logical unit contain environment and processing status information on the main logical unit that are required for copying onto a main memory area controlled by the standby logical unit; and when the main logical unit requests the standby logical unit to take over processing, the main memory addresses for the memory areas are stored or rewritten onto the information storage space when acquiring the memory area or changing locations on the memory area; and
when taking over the processing from the main logical unit, the standby logical unit searches in sequence, the information address space on the main memory addresses for the plurality of memory areas, and obtains the information on the main memory area controlled by the main logical unit, based on the main memory addresses.
The control method of the present invention is further characterized in that a control processor defines logical structure of a shared main memory multiprocessor system, such that one logical unit is a main logical unit and the other unit is standby logical unit; and a memory segment on a main memory device capable of being accessed from both the main logical unit and from the standby logical unit is provided; and an information storage space is installed on the memory segment; the spaces are maintained and controlled, and operating status of the main logical unit is monitored and, when an error is detected in monitoring results in the main logical unit, the request for take-over of control is issued according to the state of the error, and instructs the standby logical unit to take over control of all or a portion of the main logical unit processing.
The control method of the present invention is also characterized in that the controller processor is one of a plurality of the processors comprising the shared main memory multiprocessor, or is an external control terminal provided for the shared main memory multiprocessor.
The present invention is further characterized in that when a main memory area controlled by the main logical unit is protected from access by other logical units, the controller processor receives a request from the standby logical unit to access the main memory area controlled by the main logical unit, implements access, and transfers information obtained by the access to the standby logical unit.
The present invention is yet further characterized in that operating status of the main logical unit is monitored, and when an error is detected in results from monitoring the main logical unit, the standby logical unit takes over all of the main logical unit processing when that error causes a problem to occur, and takes over a portion of the main logical unit processing when that error has caused a high load condition to occur.
The present invention is still further characterized in that when the standby logical unit takes over the main logical unit processing, the program code implemented by the standby logical unit is copied by the standby logical unit, from the main memory area controlled by the main logical unit, to the main memory area controlled by the standby logical unit, or the standby logical unit directly uses a program code already present in the main memory controlled by the main logical unit.