The term upgrading a software application generally refers to the process of replacing an existing version of the software application with a newer version, adding a new version of the software application where none previously existed, or somehow changing an existing version of the software application to a newer different version. A software upgrade may be performed for various reasons such as to add one or more features, remove one or more features, modify one or more features in an existing version of the software, remove bugs or errors, improve the software efficiency, and other reasons. An upgrade is generally performed to enhance the performance of a software application.
Many modern computing environments typically include a framework of multiple heterogeneous software applications, which may be developed by different third-party entities. Each software application may include zero or more plugins. The plugins may include software components that add a new utility/feature to or enhance the utilities/features of a software application. The applications may execute on or be hosted by multiple hosts in a distributed environment, with each host potentially hosting multiple applications. Performing an upgrade operation in such a heterogeneous distributed environment comprises executing multiple upgrade processes possibly concurrently on multiple hosts to upgrade applications hosted by the hosts. The overall upgrade operation performed in such an environment is further complicated due to the possible dependencies among the upgrade processes that make up the overall upgrade operation.
Given the complexity of performing software upgrades in a heterogeneous distributed computing environment, it is very difficult to determine when something has gone wrong. For example, it is very difficult to determine if and when a particular upgrade process has stopped functioning properly, for example, if the upgrade process has frozen and entered a hang state. For example, an upgrade process may be considered to have entered a “hang state” when the upgrade process has frozen execution before completion of the upgrade process and is no longer able to resume normal operation from its frozen state. An upgrade process in a hang state may not even respond to any inputs. Due to potential dependencies between the various upgrade processes that may be executed as part of the overall upgrade operation, the hanging of a first upgrade process may in turn cause a second upgrade process to hang, and so on. This may result in a chain reaction causing multiple upgrade processes to hang or freeze and potentially even cause the entire upgrade operation to enter a state in which no upgrade processes or activities can be continued or carried out.
Conventionally, the detection of a hung upgrade process is done manually, typically when an operator detects that an upgrade process has been executing much longer than expected. The operator typically has to manually examine and analyze data generated by the upgrade process, such as log files, etc., to determine or confirm whether the upgrade process is indeed in a hang state or just taking a longer time to complete. This detection may not occur until after a very long time after the upgrade process has transitioned in a hang state and consequently corrective actions for handling the hang scenario (e.g., killing the upgrade process and restarting it) may not be initiated until an inordinate amount of time has been wasted.
Some conventional systems include diagnostic tools that try to detect a hung thread from among multiple threads in a single multithreaded process, where the thread executes in the process' execution environment. The detection is thus limited to a thread within a single process. Moreover, such tools use a single pre-defined threshold configured for the tool for such detection and cannot be customized for different upgrade processes for different heterogeneous hosts. Also, in such systems, the single pre-defined threshold is set to a high threshold value to avoid false alarms or false positives (i.e., to avoid indication of a hang situation when in reality no hang exists). This high threshold value makes its use pretty useless since several hang scenarios cannot be detected until much later in time until the high threshold value has been reached or exceeded. Consequently, such diagnostic tools cannot be used in heterogeneous computing environments where a large number of upgrade processes are being executed, possibly many in parallel, on hosts of differing capabilities.