The present disclosure relates generally to a multisystem cluster, and, in particular, to a system and method for detection of signaling sympathy sickness in a multisystem Sysplex®, and a system and method for policy directed resolution of such sympathy sickness.
In a multi-system Sysplex®, if a system is not processing work in a timely manner, it can cause “sympathy sickness” on other systems in the Sysplex®. Sympathy sickness can surface in a variety of ways, but the symptoms typically include hangs where work performed on the systems do not make progress, work queues become longer and longer, and processes can terminate due to timeouts. When these types of conditions occur, it can be difficult to correctly diagnose the root cause as being a problem on some other system. Indeed, some of the tools (such as operator display commands) used to investigate the symptoms may be rendered useless as they themselves are impacted by the sympathy sickness problem. Because the problem cannot be diagnosed easily, operators are susceptible to taking inappropriate actions. For example, they may take down the system that is suffering sympathy sickness as opposed to the system that is causing the sympathy sickness
Some installations implementing Sysplex® can experience this type of sympathy sickness due to cross-system coupling facilities (XCF) signaling. Typically, a multisystem Sysplex® includes components to monitor and report members that are not processing signals in a timely manner. For example, the XCF component of z/OS includes code to monitor and report (via messages to the operator and to log files) when it finds that a member of an XCF group is not processing signals in a timely manner. A member in this state is called a “stalled member”. In some cases, the fact that the member is stalled has no apparent impact to the Sysplex®. In some cases, the stall condition persists long enough to consume all available signaling resources so that no new signals can be received across the signaling paths, which in turn causes signals to back up on the sending system(s), which can then cause delayed signal delivery and outright rejection of new signaling requests. These signal delays and or signal rejects then, in turn, impact the exploiting applications and subsystems, causing sympathy sickness. But from the “member is stalled” messages, one cannot typically discern whether the condition is in fact causing sympathy sickness. In addition, even if there does appear to be sympathy sickness (hangs, etc) on some other system, it may not be possible to discern whether the stalled member is actually causing the sympathy sickness.
Since XCF issues messages to indicate that a member is stalled, it is possible to use automation to react to these messages. For example, one might try to discover the reason for the member being stalled and take an appropriate action to remedy the condition. However, one problem with relying on the interception of these messages is that the component of z/OS responsible for processing messages, itself relies on signals and so may be impacted by sympathy sickness (or could suffer from other local problems) that prevent it from processing messages. Thus, the system automation may never see the message to which it is to react. Another problem with relying on these messages is that it is not necessarily the case that the stalled member is causing a sympathy sickness impact. Thus, one might take action needlessly or erroneously. If there is no automation to take action, there is an exposure that the operator will fail to notice the messages (assuming they do get presented) or fail to understand them, and thereby fail to take action to remedy the stalled member. In addition, one can only guess as to whether any action taken against the stalled member can solve the problem because one likely cannot definitively associate the sympathy sickness with the stalled member.