Modem computer controlled devices rely heavily on the proper functioning of software processing to control their general operation. Typically in such devices, an operating system is made up of one or more software programs that execute on a central processing unit (CPU) in the device and schedules the operation of processing tasks. During execution, the operating system provides routine processing functions which may include device resource scheduling, process control, memory and input/output management, system services, and error or fault recovery. Generally, the operating system organizes and controls the resources of the device to allow other programs or processes to manipulate those resources to provide the functionality associated with the device.
Modem central processing units (i.e. microprocessors) can execute sequences of program instructions very quickly. Operating systems which execute on such processors take advantage of this speed by scheduling multiple programs to execute xe2x80x9ctogetherxe2x80x9d as individual processes. In these systems, the operating system divides the total available processor cycle time between each executing process in a timesliced manner. By allowing each process to execute some instructions during that process-designated timeslice, and by rapidly switching between timeslices, the processes appear to be executing simultaneously. Operating systems providing this capability are called multi-tasking operating systems.
Fault management and control in devices that execute multiple processes is an important part of device operation. As an example of fault control, suppose that a first process depends upon a second process for correct operation. If the second process experiences an error condition such as failing outright, hanging or crashing, the first dependent process may be detrimentally effected (e.g., may operate improperly).
Fault control in some form or another is usually provided by the operating system, since the operating system is typically responsible for dispatching and scheduling most processes within a computer controlled device. Many prior art operating systems and device control programs include some sort of process-monitoring process such as a dispatcher process, a monitor daemon, a watchdog process, or the like. One responsibility of such monitoring processes is to restart failed, hung or crashed processes.
As an example, prior art U.S. Pat. No. 4,635,258, issued Jan. 6, 1987, discloses a system for detecting a program execution fault. This patent is hereby incorporated by reference in its entirety. The fault detection system disclosed in this patent includes a monitoring device for monitoring the execution of program portions of a programmed processor for periodic trigger signals. Lack of trigger signal detection indicates a fault condition and the monitoring device generates a fault signal in response to a detected faulty program execution condition. Logic circuitry is included for restarting a process that the monitoring device indicates has faulted. The monitoring device may require a predetermined number of trigger signals before indicating an alarm condition. The system also includes circuitry for limiting the number of automatic restarts to a predetermined number which avoids continuous cycling between fault signal generation and reset.
Prior art fault management systems can experience problems in that these systems restart a failed process with little or no regard to why the process faulted in the first place. Prior art systems using this approach can heavily burden the processor and other system resources, since restarting a process can require significant overhead. In extreme cases, high system overhead may have caused the fault in the first place, and the prior art restart mechanisms which further load the system only serve to compound the problem.
Furthermore, by not determining the cause of the fault, prior art systems that simply restart faulted processes may end up creating lengthy, and possibly xe2x80x9cinfinitexe2x80x9d process restarting loops. Such loops may over-utilize system resources such as processor bandwidth and memory in attempts to endlessly rejuvenate a failed process that may be failing due to an external event beyond the control of the process.
Those prior art systems that attempt restarts with only a limited number of allowed restarts may avoid the problem of endless process restarting loops, but still suffer from system over-utilization during the restart period.
In contrast, the present invention provides a unique approach to fault management. In this invention, fault conditions related to processes can be handled passively, actively, or both passively and actively through the use of a unique process restart sequences. Passively handling faults, called passive fault management, comprises detecting faults and waiting for a period of time for condition that lead to the fault to change and the fault to correct itself. On the other hand, active fault management attempts to determine the cause of the fault and to remedy the situation thereby preventing future faults.
In this invention, the process restart sequences allow restarting of failed process according to a sequence or schedule that manages the loads placed on system resources during failure and restarting conditions while maintaining the utmost availability of the process. In real-time or mission critical environments, such as in data communications networking devices or applications, the invention provides significant advancements in fault management.
More specifically, embodiments of the present invention relate to systems, methods and apparatus for handling processing faults in a computer system. According to a general embodiment of the invention, a system provides a method of detecting a fault condition which causes improper execution of a set of instructions. The system then determines a period of time to wait in response to detecting the fault condition and waits the period of time in an attempt to allow the fault condition to be minimized. This is an example of passive fault management. The system then initiates execution of the set of instructions after waiting the period of time. The system then repeats the operations of detecting, determining, waiting and initiating. Preferably, each repeated operation of determining a period of time determines successively longer periods of time to wait. Accordingly, this embodiment of the invention provides a passive process restart back-off mechanism that allows restarting of processes in a more controlled and time-spaced manner which conserves system resources and reduces peak processing loads while at the same time attempting to maintain process availability.
Preferably, the system is implemented on a computer controlled device, such as a data communications device. The device includes a processor, an input mechanism, an output mechanism, a memory/storage mechanism and an interconnection mechanism coupling the processor, the input mechanism, the output mechanism, and the memory/storage mechanism. The memory/storage mechanism maintains a process restarter. The process restart is preferably a process or program that executes as part of, or in conjunction with the operating system of the device. The invention is preferably implemented in an operating system such as the Cisco Internetworking Operating System (IOS), manufactured by Cisco Systems, Inc., of San Jose, Calif.
The process restarter executes in conjunction with the processor and detects improper execution of a set of instructions on the processor and re-initiates execution of the same set of instructions in response to detecting improper execution. The process restarter also repeatedly performs the detecting and initiating operations according to a first restart sequence, and repeatedly performs the detecting and initiating operations according to a second restart sequence. The second restart sequence causes the process restarter to initiate execution of the set of instructions in a different sequence than the first restart sequence. In the second restart sequence, the process restarter performs the operation of detecting, and then waits for expiration of a restart interval before performing the operation of initiating.
Each restart interval between successive repetitions of the second restart sequence becomes progressively longer in duration. Also, each restart interval can be computed, depending upon the embodiment, based on a formula based on at least one of a geometric, an exponential, a logarithmic, an incremental, a progressive, a linear, an increasing, a decreasing and a random pattern.
The computer controlled device can also include a helper process that resides in the memory/storage mechanism and executes in conjunction with the processor. The helper process executes, preferably, during the expiration period of the restart interval during the second restart sequence in order to diagnose and correct at least one fault condition causing the improper execution of the set of instructions detected by the process restarter.
According to another embodiment, a method is provided which detects improper execution of a set of instructions. The set of instructions may be a process or program which is executing (or is interpreted) on a device, or may be a routine, sub-routine, procedure, code, thread, macro, interpreted series of statements, and so forth. The system of the invention initiates execution of the set of instructions (e.g., restarts the process) in response to detecting the improper execution. Then the system repeats the steps of detecting and initiating according to a first restart sequence. The first restart sequence defines the time sequencing and the number of times that the detecting and initiating operations are performed, and preferably performs the operation of initiating execution of the set of instructions immediately or shortly after, and in response to, the step of detecting the improper execution. In this manner, the first restart sequence attempts to restart the process as quick as possible for a limited number of detected failures, and then enters into the second restart sequence.
According to one aspect of the first restart sequence, the system detects a fault condition associated with the set of instructions and determines if the fault condition exceeds a maximum number of fault conditions associated with the first restart sequence. If the maximum number has not been exceeded, the system initiates execution of the set of instructions, such that upon each successive step or operation of detecting and initiating according to the first restart sequence, execution of the set of instructions is quickly initiated (e.g. restarted). The restarting can be done immediately, or a short amount of time can be provided to allow the operating system or failed process to be cleaned up (i.e., processing resources freed and memory released).
After the first restart sequence has completed its attempt to restart the failed process quickly after each failure, if the improper execution of the set of instructions (i.e., the process failure) continues to be detected, the system repeats the operations of detecting and initiating according to a second restart sequence.
The second restart sequence initiates execution of the set of instructions in a different sequence than the first restart sequence. In one embodiment, the second restart sequence performs each step of initiating in response to the step of detecting after expiration of a current time interval that is different than a former time interval of a former repetition of the second restart sequence. In a another embodiment, the current time interval is greater than the former time interval, such that each repetition of the second restart sequence initiates execution of the set of instructions after waiting progressively longer time intervals in response to the step of detecting.
Unlike the first restart sequence that provides quick or immediate restarts, the second restart sequence provides different time intervals after each failure detection in order for the fault to be remedied before again reinitiating the set of instructions that form the process. The time intervals (restart intervals) preferably get larger as failure progress, though the invention can include embodiments in which the time intervals get smaller or are simply random or selected by another mechanism which makes two successive time intervals different from one another. The embodiments summarized thus far that use the second restart sequence are also examples of passive fault management, in that the second restart sequence does not, in these embodiments, proactively attempt to correct the failure, other than by restarting the process after varying delay or restart intervals.
Other embodiments of the invention provide that the delays between process restarts in the second restart sequence can be user or device administrator programmable or can be specified from information provided by the process (which is being successively restarted). The second restart sequence can also base process restart intervals (the time between failure and restart, or between a former restart and a current restart) on a mathematical time-based algorithm system that includes one or more of the following progressions for the restart intervals: geometric, exponential, logarithmic, incremental, progressive, linear, increasing, decreasing or random or may use another formula-based system to determine the restart interval.
According to a more specific embodiment of the invention, the second restart sequence performed by the system of the invention determines a runtime for the set of instructions. Then the system determines a next restart interval based on the runtime for the set of instructions. The step of initiating execution of the set of instructions is then performed after expiration of the next restart interval and upon each repetition of the second restart sequence, the next restart interval is different.
The next restart interval determined in each successive repetition of the second restart sequence can be progressively longer in duration that a next restart interval determined in a former repetition of the second restart sequence. To determine a next restart interval, the system in one embodiment determines if the runtime for the set of instructions is less than a current restart interval, and if so, advances the next restart interval based on the current restart interval. In another embodiment, the next restart interval uses the runtime for the set of instructions to select a next restart interval from a set of next restart intervals associated with the second restart sequence. The restart intervals may be stored in a table, list, or other file or data structure, or may be provided by the process or may be programmed by a device administrator, or may be calculated during processing.
To end the second restart sequence, the system in one embodiment determines if the runtime for the set of instructions exceeded a current restart interval, and if so, initiates execution of the set of instructions and terminates the second restart sequence. This aspect of the invention assumes, for instance, that if a process can successfully execute for a period of time that exceeds the current restart interval, then the former failure condition has disappeared and some new event is most likely causing the process to fail. As such, this new failure condition should be handled by returning to use of the first restart sequence followed by the second.
According to the various embodiments of the invention, the operation of determining the time between the steps of detecting and initiating in either or both the first and second restart sequences can be programmable. Also, the operation of initiating in the second restart sequence can be performed at an elapsed time interval measured from a former step of detecting, or can be performed at an elapsed time interval measured from a former step of initiating.
The operation of detecting improper execution of a set of instructions in embodiments of the invention can also detect a fault due to a resource required by the set of instructions. For example, the fault may be due to a hung, crashed, or failed process required by the set of instructions. Alternatively, the fault may be simply that a correctly executing process is so busy or is so over-utilized that it cannot adequately supply its functionality to, for example, other process or system resources which rely on the process. Thus the definition of fault and improper execution of a set of instructions in this invention is relative to the application of the device and may range from a process being very busy or sluggish to respond to the process actually functioning improperly or not at all or no longer existing within the device.
In embodiments which use active fault management, an operation is included in the system of the invention which initiates execution of a set of helper instructions (e.g. helper process(s)) in response to the step of detecting improper execution of a set of instructions during the second restart sequence. The set of helper instructions performs functions to assist in the handling of processing faults in the computer system. There may be one or many sets of helper instructions and each set may execute as a separate helper process or within a single large helper process.
The helper processes are designed to assist in fault diagnosis and correction, and are an example of active fault management. Some helper processes may merely perform passive diagnostic actions, while others may be proactive and can seek-out and remedy fault conditions causing the improper execution of instructions. In the case of multiple helper processes, the set of helper instructions can be selected based upon the next restart interval, where each interval has a specific associated helper process. The helper processes can vary in their level of robustness or processing capabilities with respect to how detailed their fault analysis and correction capabilities can extend.
The invention also provides an embodiment which includes a method for fault management in a computer controlled device. This method embodiment includes the steps of detecting a fault condition associated with a process and determining a runtime for the process. A step of determining a restart interval based on the runtime for the process is also included. Also, a step is provided of executing a helper process associated with the restart interval to diagnose and remedy the fault condition associated with the process. The helper process preferably executes during expiration of the restart interval. The embodiment also includes a step of initiating execution of the process after expiration of the restart interval.
Another embodiment comprises a computer program product having a computer-readable medium including computer program logic encoded thereon for controlling faults in a computer controlled device, such that the computer program logic, when executed on at least one processing unit within the computer controlled device, causes the at least one processing unit to perform the steps of the method embodiments described herein.
Yet another embodiment provides a process control block data structure maintained in a computer readable medium, such as in memory or on a disk. An example of such a process control block data structure is one that is maintained by an operating system. The process control block data structure maintains information about an instantiation of a process and information about at least one restart pattern used to reinitiate the process in the event of a failure of the process.
A propagated signal embodiment provides a propagated signal that contains a process control block data structure. The process control block data structure maintains information about an instantiation of a process and information about at least one restart pattern used to reinitiate the process in the event of a failure of the process.
Through the use of the above embodiments, the invention can precisely control process restarting sequences in the event of process failures. Using the first and second restart sequences, the invention can minimize downtime of a process. The invention also helps ensure correct overall system and device operation by proactively managing and controlling system resources in response to processing faults. By controlling process re-instantiation, propagation of fault conditions is controlled which helps to avoid recursive processing failures. The invention also helps prevent overuse of system resources during process re-instantiation and termination by spacing process restarts out over time. The time intervals between process restarts can be programmable which allows a user or device administrator to define specific sequences and offers flexibility that can be made dependent upon the application of the specific device using the invention.