The present invention relates to a method and system for providing protection from a failure of equipment in critical systems. More specifically, the present invention relates to a sparing method and system for accommodating an equipment failure by providing and utilizing spare equipment.
It is known to provide redundant, or spare, equipment to replace or substitute for failed equipment in critical systems. For example, in critical equipment such as telephone switches or online transaction processing systems, it is well known to provide a redundant, or spare, power supply which is employed to keep the system running if a failure is detected in the primary power supply. In the event of such a failure, the spare power supply is substituted for the failed primary power supply, ideally in such a manner that operation of the system is not interrupted. Indeed, much study has been performed to identify points of failure in such critical systems and especially to identify single failure points, such as power supplies, where the single failure can result in an entire system failing.
In many circumstances, while spare equipment and failure detection means are provided, the switch-over from the failed equipment to the spare equipment is not without a cost. For example, while a battery can be provided to maintain a supply of power to equipment while a change from a primary power supply to a spare power supply is performed, in other circumstances, such as failure of a processor or communications link in an online transaction processing system, an interruption to the processing of the transaction data or even a loss of such data can occur.
Critical systems for which spare equipment is generally required can include telecommunication and/or data switching equipment, avionics systems, manufacturing and/or process control systems, etc. and such systems are often carefully designed to reduce or eliminate single failure points and to provide spare equipment.
Using switching equipment as a specific example, such equipment typically includes several network interface processors, each of which can include one or more microprocessors, memory, etc. and each of which operates to implement and maintain network connections and protocols. To date, sparing for the network interface processors in a switch has been provided by either including a spare network interface processor for each network interface processor in the switch, referred to as 1:1 sparing, or by providing a single spare which can be employed as a spare for any failed network interface processor of the n network interface processors in the switch, referred to as 1:n sparing.
Each of these sparing strategies has suffered disadvantages. Specifically, 1:1 sparing requires twice the amount of equipment than is in use at any one time and thus is very expensive to implement, raising the cost of ownership (manufacturing or lease, operation, including supply of power and cooling, etc.) of the critical system. While a critical system which employs 1:n sparing is less expensive to own and/or operate, as it only requires a singe spare for the n pieces of equipment, it suffers a disadvantage in that substituting the spare equipment for a failed piece of equipment requires time to bring the spare equipment from an idle state to the state required of the failed equipment. This time period from idle to running as a substitute can often exceed a critical time, such as the drop time (the maximum time a connection will be maintained when the network interface processor controlling the connection is inoperative) in the case of a telecommunication switch. In this example, if the drop time is exceeded, the connections controlled by the failed network interface processor are dropped and must be re-established once the formerly spare network interface processor is running in place of the failed network processor. Reconnection of such dropped connections can require multiple minutes to achieve and thus connections can be lost for unacceptable time periods. Similarly, avionics and other critical systems have critical times within which service must be, or is desired to be, restored.
It is therefore desired to have a better method and system to cope with equipment failures in critical systems, such telecommunication and/or data switching equipment, etc.
It is an object of the present invention to provide a novel sparing method and system to accommodate equipment failures in critical systems which obviates or mitigates at least one disadvantage of the prior art.
According to a first aspect of the present invention, there is provided a sparing method to accommodate equipment failures in a critical system comprising n pieces of equipment and at least one spare piece of equipment for said n pieces of equipment, comprising the steps of:
(i) loading software onto each of said n pieces of equipment;
(ii) commencing execution of said loaded software with said respective piece of equipment;
(iii) transferring an image of said loaded software on each of said n pieces of equipment to a memory on said at least one spare piece of equipment;
(iv) detecting a failure of one of said n pieces of equipment;
(v) causing said at least one spare piece of equipment to replace said detected one piece of equipment by commencing execution of said corresponding image of said loaded software and configuring said system and the non-failed ones of said n pieces of equipment to employ said at least one spare piece of equipment in place of said detected one piece of equipment.
According to another aspect of the present invention, there is provided a sparing system to accommodate equipment failures in a critical system comprising:
n pieces of equipment, each piece of equipment including a memory to maintain software to be executed by said piece of equipment and state information for said piece of equipment;
at least one spare piece of equipment including a journal memory to store an image of said software maintained in said memory of each of said n pieces of equipment;
a communication path between each of said n pieces of equipment and said at least one spare piece of equipment to allow transfer of an image of said software to be executed by each said piece of equipment to said journal memory, each of said n pieces of equipment operable to transfer a respective one of said images to said journal memory; wherein upon determination that one of said n pieces of equipment has experienced a failure, said at least one spare piece of equipment loads said image corresponding to said one piece of equipment and operates to execute said image to replace said one piece of equipment.
In one embodiment, the present invention provides a system and method of accommodating failure of equipment in critical systems through 1:n sparing, where an equipment spare is provided for every n pieces of equipment in the system. The spare equipment includes journal memory wherein an image of the operating system, applications and state of each piece of equipment being spared for can be maintained. The image information is updated at appropriate intervals when the equipment is operating without a failure and, in the event of a piece of equipment failing, the spare equipment loads the state of failed equipment from the corresponding image and commences executing the software in the corresponding image, thus resuming performance of the activities of the failed equipment. In a second embodiment of the present invention, m:n sparing is provided and employed.