The present invention relates generally to distributed computing environments, and more specifically to a fault tolerant distributed computing framework in a mission critical environment.
Today, it is quite common to have complex computer systems with multiple computers connected through one or more networks. Typically, applications are distributed among the multiple computers and communicate using one of several industry standard distributed computing frameworks. In general, a distributed computing framework provides a specification for how objects interact and communicate with each other. The communication may occur within one process, between two different processes on one computer and across the network to processes running on different computers. These frameworks allows an inter-process and a network communication layer to be completely transparent to the application developer. Therefore, application developers may easily scale applications across multiple machines with various architectures and various operating systems. The distributed computing frameworks also facilitate inter-operability between software components created by different vendors by clearly defining interfaces for the software components.
Currently, the Distributed Component Object Model (DCOM) defined by the Microsoft Corporation, of Redmond, Wash., is one of the most popular distributed computing frameworks for enterprise applications. Typically, applications using DCOM reside on personal computers (PCs). In some enterprises, however, it may be desirable to extend the distributed applications to a variety of embedded systems, such as heating, ventilating, air conditioning (HVAC) controllers, data loggers, and programmable logic controllers (PLCs).
In some situations, it may be desirable for some DCOM applications residing on personal computers to operate in a mission critical environment, such as industrial automation and building automation. However, there are problems with using existing distributed computing frameworks for embedded systems and mission critical systems. For instance, both embedded systems and mission critical systems typically need higher reliability standards than the typical PC applications. These higher reliability standards require the systems to recover from errors or faults without affecting the operation of the system as a whole and also require the system to recover from errors without the intervention of a human technician.
Prior attempts at achieving high reliability for embedded systems and mission critical systems have focused on creating proprietary software for each different type of system. While the proprietary software solutions offer some fault tolerant characteristics, the proprietary software still has a disadvantage because the proprietary software must be modified for each different system.
Therefore, given the shortcomings associated with the prior art proprietary software solutions, there is a present need for a fault tolerant distributed computing framework that provides high reliability without requiring the software for each different system to be modified.
In accordance with the present invention, a system and method are provided for providing a fault tolerant distributed computing framework that allows the system to detect failures and to gracefully recover from the failures. In addition, the present invention allows the system to inter-operate with existing applications and objects that operate in an existing distributed computing framework, such as DCOM.
The fault tolerant system of the present invention provides inter-operability to applications and objects that operate in an existing distributed computing framework. The fault tolerant system includes a first layer including an application proxy operable to communicate with the applications as if the applications were communicating through the existing distributed computing framework and an object stub operable to communicate with the objects as if the objects were communicating through the existing distributed computing framework and a second layer including a fault detection mechanism communicating through the first layer to determine whether any one of a plurality of objects has experienced a failure. The fault tolerant system further includes a fault recovery mechanism for recovering from the failure detected by the fault detection mechanism.