As computing solutions become more distributed in nature, it often becomes quite challenging to monitor whether all the events that occur are being properly handled. Each event typically begins with an initial stimulus, which triggers further processing and operations for handling the events, and continues until a terminal condition occurs. Examples of initial stimuli include fault conditions and requests for service. Fault conditions may include disk drive failures, over temperature conditions, etc. Requests for service may include input/output (I/O) requests, backup requests, recovery requests, replication, etc. When the further processing and operations used to handle the event is delegated to separate modules, it may not be easy to determine whether the terminal condition occurs, thus indicating that the event has been fully and completely handled. Additionally, it may be even more challenging to determine which of the many modules was unable to complete the further processing as requested.
Failure to reach the terminal condition may come from any of several sources. In some instances, the design of one or more of the modules may be faulty and result in an incomplete or defective state machine design. Failures may also occur when messages sent between the modules may be lost and/or corrupted. Deadlock conditions may occur where two or more modules are waiting on each other to finish some processing before proceeding. Additionally, one or more of the modules may include defects that result in a failure.
Accordingly, it would be desirable to provide improved methods and systems for managing the handling and tracking of events in a computing system.
In the figures, elements having the same designations have the same or similar functions.