1. Field of the Disclosure
The present disclosure concerns managing computer operations. More particularly, the present disclosure concerns a dynamic, milestone-based solution for managing computer operations.
2. Description of the Related Art
As computer technologies continue to rapidly advance, users exhibit less and less tolerance for process delays. Where at one point in time users would not bat an eye at waiting five minutes to download a digital audio file or ten minutes to configure a server, most users of today would be hard pressed to tolerate a fraction of that delay. Given that user patience has become such a limited commodity in the digital world, it is vital that any extended wait imposed on a user ultimately leads to a successful outcome. No user wants to endure an extended wait for a process to complete only to have the process fail and require another attempt. This principle is familiar even to individuals outside the computer realm. A restaurant customer, for example, may not mind waiting twenty minutes for a cooked-to-order steak. But that same customer will likely be very annoyed—and indeed may not return to the restaurant at all—if the restaurant causes him to wait the full twenty minutes before informing him that it has run out of steak.
Turning back to the computer realm, a device user will typically be much more patient with an operational delay if the operation is making progress and will ultimately succeed. Conversely, that same user will likely be annoyed if the operation is not making progress or is in an environment where it cannot succeed. For example, while attempting to configure a device in the context of a runtime network environment, a user may configure various device parameters through a device configuration interface. In order for the configuration to take effect, however, it may require a validation step. Obtaining the requisite validation may involve communication with configuration servers. In such an example, process delays may arise as the servers in the network environment attempt to read large or complex configurations. The user will typically consider such delays bearable because they eventually provide value in the form of a desired configuration. Delays may also arise, however, from a device attempting to contact a server that is not present in the network. The user will typically extend far less grace in the latter scenario because at the end of the process the user obtains no useful result. Consequently, whether users consciously recognize it or not, they prefer an operation to fail quickly if in fact it is going to fail. They are only willing to endure process delays when the process is ultimately going to succeed. The problem, however, is that it is difficult to determine in advance whether an operation will fail without first waiting for it to fail.
In one illustrative network scenario, a user can reasonably expect that a Transmission Control Protocol (TCP) connection to a server should be established in 10 seconds or less. A user can reasonably expect up to a 30-second wait as the server processes a configuration query and starts returning results. And a user can reasonably expect to wait several minutes before receiving the full results.
In such cases, a simple 30-second timeout may be too long for the TCP connection step. Users are not willing to wait 30 seconds for a connection to fail when their expectation is that it will succeed in 10 seconds or less. And yet, at the same time, a 30-second timeout may be too short when the network devices are attempting to exchange significant amounts of data over a slow network. Such scenarios result in artificial failures in which the result would have been achieved had the timeout not killed the operation early. On top of wasting valuable time, such scenarios can prevent the network environment from functioning properly. Thus, a “catch 22” exists. Reducing the timeout shortens the delay experienced by the user, but increases the risk of inducing artificial failures. Increasing the timeout reduces the risk of inducing artificial failures (i.e., by giving the operation more time to succeed), but lengthens the delay experienced by the user without providing any guarantee of success.
Previously attempted solutions have failed to adequately address the problem. One such solution involves implementing a regular tick-based watchdog. With a regular tick-based watchdog, the operation will regularly indicate that it has not deadlocked or crashed. But that solution is not suitable because it does not guarantee that the operation will make any progress. Even if the watchdog is reset only at progress points, it does not consider the fact that different phases require different length timeouts. Enabling a regular tick-based watchdog to do so would require modifying the program itself, which may affect certification of correct behaviors, violate copyright law, or be very difficult to achieve.
Another inadequate attempt to solve the problem involves providing continual monitoring of the existence of network devices and avoiding attempts to perform the operations when the network devices appear absent. That solution is sub-optimal because it requires continual generation of monitoring traffic and may fail to catch cases where network devices exist but particular services are not responsive.
Yet another inadequate attempt to solve the problem that is specific to the configuration environment involves providing a cached result from a previous configuration query. But that solution is also sub-optimal and is particularly ill-suited in a configuration environment because cached results are unreliable and can be misleading. For instance, when a user is testing a recent configuration change, the cached result will not be affected by the change. At any point an administrator might access a main controller and delete or modify configuration records. In the configuration context, it is critical that records are current (e.g., a record indicating whether or not a given device has a valid domain membership). Relying on potentially out-of-date cached results in an unreliable approach.
Further, because of the manner in which library routines and commands used in scripts encapsulate behavior, it can be difficult—if not impossible—to merely insert watchdog timers at relevant points in an operation. Retroactively integrating appropriate watchdog timers is complicated, tedious, and error-prone because it requires decomposing an operation into smaller parts and potentially introducing bugs.
Thus, there is a persistent need in the art for an improved method of managing computer operations.