Systems that run application software typically employ a Fault Management System (hereinafter “FMS”) to monitor and manage any faults that occur in the system. As such, an FMS must be able to identify situations that constitute a fault condition and then determine what information needs to be collected about that particular fault situation. A typical FMS includes at least one software application to be monitored, at least one monitoring agent and at least one fault event management system. When the software application being monitored encounters a fault situation, the software application generates a fault event message which is captured by the monitoring agent and reported to the fault event management system.
Currently, a number of different approaches are used to determine what situations constitute a fault situation and, in the event of a fault situation, what information regarding the fault situation should be collected and reported. For example, one approach may be to leave the fault situation determination up to the developer. Thus, the developer will program in specific event situations that constitute a fault event. Unfortunately, however, this approach has several disadvantages. One disadvantage occurs because the FMS completely relies on the software application being monitored to provide all of the required information regarding the fault event and if the application fails to report a fault event to the FMS, the FMS has no information regarding the fault event. Another disadvantage occurs because the fault event information to be reported to the FMS is defined during the development of the software application and is thus, ‘hard’ programmed into the software. In this situation, the fault event information to be reported to the FMS is not easily modified and cannot be adjusted ‘on-the-fly’. Thus, if the reported fault event information is not sufficient for issue resolution, modifications to the software application are necessary in order to collect more information and troubleshoot the cause of the fault event.
Still another disadvantage occurs when, at a section of the source code where a fault event occurs, the developer may not have sufficient information to decide if the fault event is a critical event, whether the fault event needs to be reported to the FMS or if the fault event should be handled by the software application itself. For example, consider the situation where a software application comprises two components, a Business Component (BC) and a File Access Component (FAC), where the BC uses the FAC for all file related operations and where an exception (i.e. a fault event) occurs when the FAC attempts to access a specific file. The moment the exception occurs, the FAC will have all of the information about the application environment, such as file name, file path, current settings, etc. However, at this point, the FAC does not have enough information to determine whether the BC will correctly handle the exception. Moreover, although the BC has enough information to determine whether the exception is a critical exception, the BC does not have information about the exception environment besides that information that is reported by the FAC. Thus, when an exception occurs, the software application has all of the information about the exception environment, but no information regarding the criticality of the exception and when the software application does have information regarding the criticality of the exception, the exception environment does not exist.
One way to solve this problem would be to collect all known information about the exception environment and provide this information to upper level components, allowing these components to make decisions about the criticality of the exception. Unfortunately however, because the application would spend a significant amount of time collecting information regarding the exception environment, this approach may seriously affect the performance and scalability of the application. Another way to solve this problem would be to collect minimal information regarding the exception environment and to provide this information to an upper level component. Unfortunately however, while this approach does not appear to impact application performance, it may lead to situations where there is not enough information to trouble shoot an issue, thus leading to the same problems listed above.
One alternative approach to relying on the software developer for application fault reporting involves performing an Automated Static Instrumentation of the code. For example, once an application has been developed, the application is process using an instrumentation tool which adds exception management code onto the source code or at the binary level. The resultant, or processed, code is then used. Unfortunately, although this approach resolves the issue of the FMS relying on the software application being monitored to detect and report fault situations, the problem of static amounts of reported information and the inability to distinguish between critical exceptions and non-critical exceptions at the moment the exception occurs still exists.