Enterprise systems typically utilize multiple processors connected via network to increase efficiency as well as reliability. Efficiency of execution is improved by utilizing multiple processors to perform a computation in parallel. Reliability is improved by ensuring that the service performed by the enterprise is not interrupted in spite of failures. A failure can occur in a processor or in the communication infrastructure. A large distributed system that employs a large number of processors and a complex communication infrastructure is likely to encounter more failures in a given period compared to a smaller system since the larger system has more components that can fail. Enterprises often use distributed systems to provide services that can cause significant losses to the enterprise if interrupted. An enterprise that sells products or services online, for example, an online bookseller or an online reservation system can lose large amount of revenue if the online service is down for a long time.
Furthermore, distributed systems for certain businesses need to be designed so that there is no loss of data when failure occurs. For example, the system may continuously receive requests and updates from users. However, the system is not expected to lose any of the requests or updates in spite of failures. Loss of data can cause liability for enterprises or significant effort in either restoring the information or resolving issues with customers related to lost data. For example, if a customer places an order and the information regarding the order is lost, the enterprise needs to resolve the customer issue which may require live operators. Typically, the expense of resolving an issue using live operators is much higher than the cost of transactions executed automatically. Besides, loss of information may affect the reputation of the enterprise and resulting in loss of customer goodwill.
Enterprises rely on hardware solutions, for example, fault tolerant switching hardware. These solutions require the enterprise to design their architecture around specialized hardware and make it difficult for the enterprise to switch to a different vendor of hardware if they need to. Several enterprises utilize solutions that require a technician to debug the problem and isolate the faulty component. Manual determination of faults can be a tedious, slow, and expensive process. Furthermore, solutions utilized by certain enterprises require the system to be restarted. Requiring the system to be restarted results in the system being unavailable until the restart operation completes. Furthermore, requiring a restart of the system is likely to cause loss of requests from the customers and therefore loss of information.