Software companies spend considerable funds in research and development (R&D) for the conception and implementation of software be it control software, such as the microwave controller or telephone system, or application software, such as spread sheets or word processors. Furthermore, the maintenance of such systems has proven to be quite costly.
Of particular concern in software server systems is the occurrence of faults. Failure of server software can occur because of a problem with the hardware on which the application is being executed or because of a run time error in a software system. These failures can result in reduced functionality or complete failure of a telecommunication network, thereby reducing the availability and reliability of the network as a whole which could result in catastrophic accidents or significant losses in revenues for the service provider. In a specific example, take telecommunication networks that form the backbone of modern communications where millions of transactions are performed daily. A certain amount of reliability is expected by the subscribers to obtain continued service even in the case of component failure. As another example, computer systems in an aircraft must continue to operate until the plane has landed and the computers in air traffic control systems must be continuously available. Any failure in these systems could have serious repercussions.
Reliability in such systems is often achieved through a process commonly known as component redundancy. Redundancy provides a means by which the functionality of vital parts of a system can be maintained even when the vital part is faulty through the use of a "spare". A spare replaces the original component of the system and provides some or all the services that the original component performed. Sometimes, the spare is an exact replica of the original or master component and can replace completely the faulty master part. Other times, it is a less costly and less developed version of the master and can only be used to temporarily replace the master until the latter is repaired or replaced.
A great majority of components involved in network communication are software based. In telecommunication networks, as in a great majority of control systems, the current typical process of providing a spare consists in purchasing duplicate hardware items of all the vital parts of the system or, alternatively, reinstalling the software from scratch. When a fault occurs in one of the parts, the hardware affected is replaced by the spare while the original is being fixed or replaced. If there is a problem with the software, the system must be taken offline and the software must be reinstalled. In both situations, this often leads to a prolonged down time for the system. Furthermore, this implies that even if there is a problem with a small sub-system of a network node, the entire node would have to be replaced by a spare and some of the most recent system status information may be forever lost. Therefore, this is a very costly and inflexible solution.
Thus, there exists a need in the industry to provide an improved process of redundancy and fault recovery such as to obtain better software systems with a high degree of reliability particularly applicable to applications with distributed software components.