The present invention relates generally to fault tolerant distributed computing systems, and in particular, to a method for dynamically switching fault tolerance schemes in a distributed system based on wait times of user interface events.
Fault tolerance is a key technology in distributed systems for ensuring reliability of operations for user critical applications such as e-commerce, database transactions and B2B, etc. A distributed system is a group of computing devices interconnected with a communication network which function together to implement an application. Fault tolerance provides reliability of operation from the user""s perspective by masking failures in critical system components. Known fault tolerant mechanisms for distributed systems can use different fault tolerance schemes, including different fault detection and recovery means, to handle various types of failures, such as device and network failures.
However, it is known that fault tolerance schemes may have different fault tolerance and performance trade-offs. In the context of interactive applications, fault tolerance schemes can have an adverse effect on the time that a user has to wait for a system response once the user interacts with the system, particularly in mobile computing environments. This delay can affect user perception of the performance of a system, which is significant because users are known to give up on applications if their requests are not met within certain time limits. Accordingly, it is desirable to limit detrimental trade-offs between fault tolerance and perceived system performance.
Furthermore, different applications may have different requirements for fault tolerance and performance. In addition, these requirements may change over the course of execution of the same application. It may be that no particular implementation of a fault tolerance mechanism will perform well for all applications. In this context, it is important to know when to switch fault tolerance schemes and which scheme to dynamically select.
Therefore, there is a need for a method of dynamically switching fault tolerance schemes that can improve the user perceived performance of a system while taking into account the desired level of fault tolerance.
In one aspect of the invention, a method of dynamically switching among a plurality of fault tolerance schemes is provided. The fault tolerance schemes are associated with a fault tolerance mechanism that executes in a distributed system. The method comprises obtaining a wait time of at least one user interface event occurring in the distributed system. The wait time includes at least one of a communications time, a service time and a fault tolerance time. The method further comprises determining whether a mean of the wait time is greater than a predetermined mean wait time threshold. The method also comprises determining whether the communications time, the service time and the fault tolerance time are mutually independent when the mean of the wait time is greater than the predetermined mean wait time threshold. In addition, the method comprises determining whether the mean of the wait time can be improved by reducing a mean of the fault tolerance time when the communications time, the service time and the fault tolerance time are mutually independent. The method also comprises switching from a first fault tolerance scheme to a second fault tolerance scheme when the wait time can be improved by reducing the mean of the fault tolerance time.
In another aspect of the invention, a fault tolerant distributed system capable of dynamically switching among a plurality of fault tolerance schemes associated with a fault tolerance mechanism is provided. The system comprises a means for obtaining a wait time of at least one user interface event occurring in the distributed system. The wait time includes at least one of a communications time, a service time and a fault tolerance time. The system further comprises a means for determining whether a mean of the wait time is greater than a predetermined mean wait time threshold. The system also comprises a means for determining whether the communications time, the service time and the fault tolerance time are mutually independent when the mean of the wait time is greater than the predetermined mean wait time threshold. In addition, the system comprises a means for determining whether the mean of the wait time can be improved by reducing a mean of the fault tolerance time when the communications time, the service time and the fault tolerance time are mutually independent. The system also comprises a means for switching from a first fault tolerance scheme to a second fault tolerance scheme when the wait time can be improved by reducing the mean of the fault tolerance time.