1. Field of the Invention
The present invention relates generally to computerized diagnostic and failure recovery techniques, and more specifically, to such techniques wherein a first computer process is monitored by a second computer process for occurrence of a failure condition, and the second process takes appropriate action when and if the failure condition occurs. As will be appreciated by those skilled in the art, although the present invention will be described in connection with specific embodiments and methods of use wherein the first process is a component object model (COM) server process, the present invention finds utility in diagnostics and failure recovery of processes other than COM processes. Thus, the present invention should not be viewed as being limited to use in diagnostics and failure recovery of COM processes, but rather should be viewed broadly as being limited only as set forth in the hereinafter appended claims.
2. Brief Description of Related Prior Art
The component object model (COM) is a software component architecture standard promulgated by Microsoft Corporation of Redmond, Wash. that allows applications and systems to be constructed using software components supplied by different software suppliers. The COM architecture permits higher level software components to exchange data with each other according to a well-defined protocol.
The specifics of the COM architecture are well known to those skilled in the art, and provide a programming language-independent and computer platform-independent standard for software component interoperability. More specifically, the COM architecture provides respective standards for software components executed on respective computer platforms that define how the components initialize and use virtual function tables to call functions supported by the software components via the function pointers in the tables. This standardizes the way in which components interoperate with other components when they call such functions.
In the COM architecture, an executing process that provides or uses a service may be called an xe2x80x9cobject.xe2x80x9d Objects interact with each other via xe2x80x9cinterfaces.xe2x80x9d In essence, a COM object interface provides and defines one or more related operations or functions (xe2x80x9cmethodsxe2x80x9d) provided by the object, and behaviors and responsibilities associated with these methods. An object accesses an interface of another object by utilizing a function pointer to that interface. A xe2x80x9cserverxe2x80x9d object makes available one or more of its methods to a xe2x80x9cclientxe2x80x9d object. That is, a client object accesses an interface of a server object to utilize one or more methods provided by that interface. The client object and the server object may each be separate computer processes, or alternatively, may be comprised in the same process.
According to the COM architecture, each and every interface has a respective, unique interface identifier associated with it. These interface identifiers are referred to as globally unique identifiers (GUID). When a client object wishes to discover whether or not a particular interface is supported by a server object, the client object calls a special method that is supported by all components, called QueryInterface. The result of calling this special method is either a return from the server object of the appropriate interface pointer and a success code, if the server object being queried supports that interface, or conversely, if the server object being queried does not support that interface, the server object returns an error value to the client object.
Further, in accordance with the COM architecture, when client and server objects are in different processes, proxy and stub intermediate objects are created which exchange data between the processes. More specifically, when the QueryInterface method returns a success code to the client object, and the client and server objects are in different processes, a proxy object is created in the client object and an associated stub object. is created in the server object.
There are a number of ways in which a server object may fail (i.e., experience a failure condition in its operation/execution, such as, becoming unresponsive to interface access requests from client objects). Examples of events that can cause a server object to fail include the experiencing by the server object of an untrapped exception or becoming deadlocked in its execution. When such a failure condition occurs in operation of the server object, if a client object and server object are comprised in different respective processes, a call made by the client object to a server object interface will be ineffective to call methods of that interface, and will instead result in return of an error message from the operating system to the client object. Typically, the client object may be programmed to take corrective action to return the server object to a normal operating mode (e.g., by issuing appropriate requests to the operating system that the operating system terminate and restart the server object), if the client object receives a predetermined number of such error messages in a predetermined time period.
Typically, after the server object has failed, a time period of several seconds may occur between the issuing of an interface call and the return of an error message to the client object from the operating system. Thus, since the client object typically will not take corrective action to return the server object to a normal operating mode unless the client object has received in the predetermined time period multiple error messages from the operating system, there may be a significant time lapse between failure of the server object and the taking of such corrective action by the client object. Disadvantageously, the significant time lapse that may exist between failure of a server object and the taking of corrective action by the client object introduces significant inefficiencies into the interactions between the client and server objects, and may reduce the processing efficiency of the computer system. If multiple client objects are involved, these inefficiencies may be further exacerbated.
In accordance with the present invention, computerized diagnostic and failure recovery techniques are provided that are able to overcome the aforesaid and other disadvantages and drawbacks of the prior art. More specifically, in one aspect of the present invention, a diagnostic and failure recovery technique is provided in which a first computer process (e.g., a COM server object process) requests that a second computer process monitor the first process for occurrence of a failure condition in operation of the first process. The second process initiates, if the second process determines that the failure condition has occurred, corrective action to return the first process to a normal operating mode.
In a second aspect of the present invention a technique is provided that may be practiced separately or in combination with the technique of the first aspect of the present invention. In the technique of the second aspect of the present invention, the first computer process requests that the second computer process monitor the first process for occurrence of a failure condition in operation of the first process. The second process provides to a third computer process (e.g., a COM client object process) an indication as to whether the failure condition has occurred. The second process may provide this indication to the third process in response to a request for such indication from a special proxy object in the third process. The proxy object may also detect and correct a pointer to the first process made invalid due to a failure and subsequent recovery of the first process.
The corrective action that may be taken by the second process in the technique of the first aspect of the present invention may comprise terminating and restarting the first process (e.g., via issuance of appropriate requests to an operating system process). Additionally, the second process may provide the first process with data that enables the first process to change its current internal state after having been terminated and restarted to a normal internal state that existed prior to the occurrence of the failure condition. The provision of this data to the first process by the second process may be initiated by the transmission to the second process from the first process of an identification code that uniquely identifies the data.
The unique identification code may have been first provided by the second process to the first process after termination and restart of the first process. The data transmitted to the first process that permits the first process to return to the normal internal state that existed prior to occurrence of the failure condition may have been provided to the second process by the first process prior to occurrence of the failure condition. When the first process receives this data, the first process may verify the validity of the data prior to using it to change its internal state.
In the technique of the second aspect of the present invention, if the second process determines that the failure condition has occurred, the second process may initiate corrective action to return the first process to a normal operating mode, which corrective action may comprise terminating and restarting the first process. The second process may determine whether the failure condition has occurred based at least upon whether the second process has received a message from the first process within a predetermined time period. The predetermined time period may be specified in a message from the first process to the second process that requested that the second process monitor the first process for occurrence of the failure condition. After issuing this request message, the first process may provide to the second process, so long as the failure condition has not occurred, periodic messages (e.g., method calls) indicating that the failure condition has not occurred.
Thus, in accordance with the present invention, a first process (which may be a COM server object process) may be monitored by a second process for occurrence of a failure condition in operation of the first process. If the second process determines that such a failure condition has occurred, the second process may take appropriate corrective action to return the first process to a normal operating mode and internal state that existed prior to the occurrence of the failure condition. Alternatively, or in addition thereto, the second process may signal or notify a third process (e.g., a client object process) that the failure condition has occurred. This permits the third process to correct invalid pointers to the first process.
Thus, advantageously, the techniques of the present invention permit a COM server process failure to be determined and corrected, and notification of same to be provided to a COM client process, more quickly and efficiently than is possible in the prior art. Also, by utilizing the techniques of the present invention, interaction between such server and client processes and computer system processing may be substantially more efficient than in the prior art.
These and other features and advantages of the present invention will become apparent as the following Detailed Description proceeds and upon reference to the drawings, wherein like numerals depict like parts, and in which: