The invention relates to apparatus and accompanying methods for use preferably in a multi-system shared data (sysplex) environment for quickly and efficiently isolating (fencing), through a pre-defined hierarchical order, failed sysplex components from accessing shared data.
Prior to the early-1980s, large scale computing installations often relied on using a single monolithic computer system to handle an entire processing workload. If the system failed, all processing applications in the workload were suspended until the failure was remedied. While a resulting processing delay was tolerated at first, as increasingly critical applications were processed through the system, any such ensuing delays became increasingly intolerable. Furthermore, as processing needs increased, the entire system was eventually replaced with a new one of sufficient capacity. Replacing systems in that manner proved to be extremely expensive and very inefficient. However, at that time, few workable alternatives existed, to using monolithic systems, that appreciably eliminated both these outages and an eventual need to replace the entire system.
To efficiently address this need, over the past several years and continuing to the present, computer manufacturers are providing processing architectures based on a multi-system shared data approach. Through these architectures, multiple large scale computer systems, each of which is often referred to as a computer processing complex (CPC) or a central electronic complex (CEC), are inter-connected, through, for example, a coupling facility or other inter-processor communication mechanism, to permit each such system to gain read-write access to data residing on one or more shared input/output devices, such as a direct access storage device (DASD). The resulting inter-connected computer system is commonly referred to as a "sysplex". In a sysplex, a processing workload is distributed, e.g. in a balanced fashion, among all of the inter-connected computer systems such that each computer system is responsible for processing a portion, e.g., an approximately equal portion, of the entire workload. Each of these systems executes its own portion independently of the other such systems. Generally, separate copies (instances) of an application are resident and active on more than one of the computer systems and, based upon, e.g., the processing capacity required of the application, often on all such systems. By virtue of having shared data access, if one computer system in the sysplex fails, its particular workload can be quickly and readily taken over by another such system without interrupting application processing--as would otherwise occur in a single monolithic system. Hence, the computer system in the sysplex is sized to provide sufficient additional processing capacity, for use during a failure condition, to accommodate the processing load ordinarily handled by at least one other such system. Moreover, the processing capacity of the sysplex can be readily expanded by simply adding and appropriately inter-connecting additional computer systems into the existing sysplex and/or by increasing the processing capacity, either through replacement or upgrading, of one or more of the computer systems existent in the sysplex. As a result of its inherent fault tolerance and efficient expansion potential, sysplex architectures provide an extremely high level of overall reliability while also accommodating incremental growth in a highly cost-effective manner. Given this reliability, sysplexes are particularly attractive in handling so-called critical business support applications that involve real-time transaction processing, such as, e.g., in processing banking or stock market transactions, reservation requests or courier manifest information, which can tolerate essentially no downtime.
Furthermore, certain currently available computer systems that can be readily incorporated into a sysplex, such as illustratively the Enterprise Systems 9000 Series manufactured by the International Business Machines (IBM) Corporation, can each support, if appropriately configured, multiple simultaneously active instances of operating systems (O/S). Each such instance implements a separate corresponding individual application environment. Each of these environments utilizes a separate copy of the operating system, such as the MVS O/S (MVS and IBM are registered trademarks of the International Business Machines Corporation), so as to form a so-called O/S "image" along with a copy of corresponding application program(s) and a dedicated storage area (typically a logical partition--"LPAR"). Each of these computer systems employs at least one, though depending upon its architecture, possibly more, hardware processors as a server(s) to execute the various O/S images residing on that system. Regardless of the hardware constituency of each computer system, since each O/S image presents a unique application processing environment, that environment will be hereinafter referred to as a "system". For any application executing on multiple systems, a user is typically totally unaware of the particular system on which he or she is executing that application. Ideally, through suitable O/S software, a software failure in one system that halts application processing therein, should be isolated to that system and not affect the same application(s) being processed in any other system. Application processing would then be confined to the remaining systems, all of which are collectively sized to additionally accommodate the application processing and the users heretofore handled by the failed system. Thus, by using multiple O/S images in each CPC, a sysplex should be able to provide a further degree of fault tolerance and enhanced overall reliability, particularly to software failures than through a CPC that executes a single O/S image. Using separate O/S images and corresponding copies of application programs does require additional storage and processing overhead. However, the penalty exacted for doing so is usually quite small particularly in view of the enhanced reliability resulting therefrom and the constantly declining cost of technology.
In practice, special needs arise if a sysplex is to process critical business support applications that can tolerate minimal, and often essentially no downtime. First and foremost, any system failure must not cause all the other systems to interrupt their application processing while the failure is resolved. Any such interruption would simply halt the entire application, thereby producing an intolerable result. Furthermore, to protect integrity of the shared data, once any system fails, that system needs to be completely isolated (totally inhibited) from accessing the data. This isolation must continue until both the failure is completely resolved and the failed system is once again found to be fully and properly functional. If that system were not fully isolated and could, e.g., steal a lock resource and gain access to the data in some fashion, then that system, owing to its failure, could contaminate the shared data, for one or more applications, that would subsequently be accessed by any other such system. This, in turn, could well corrupt all further processing of these applications across the entire sysplex. In addition, a human operator should not be required to isolate the failed system and, if possible, resolve the failure itself. Currently, for reasons of economy and throughput speed, many computer installations run unattended. Requiring an operator to intervene, whether locally or remotely, would simply delay the onset of application processing thereby lowering overall throughput. Furthermore, because human operators do make mistakes, they can unwittingly corrupt the shared data. In addition, operators may possess a low level of expertise which may result in corruption of the shared data. Correctly resolving system failures in a sysplex environment, particularly without adversely affecting the shared data, and in one that is performing critical business support application processing, and also deciding issues regarding data and system availability are complex and daunting tasks. Hence, these tasks should not be assigned to an operator.
In some sysplex installations, a separate service processor has been used to automatically isolate and reset a failed system. In operation, the service processor intercepts appropriate sysplex administrative screens and, through a suitable automation routine, generates commands to the sysplex, e.g. reset commands to the failed system. Disadvantageously, this approach requires the service processor, as well as its communications facilities to the sysplex, to have an extremely high availability--an availability that can not always be guaranteed. In that regard, if the service processor or its communication lines were inactive for any reason, then this approach would be unable to isolate the failed system and protect the shared data.
Given these needs, one would think--at least ideally, that to minimize any adverse impact attributable to the loss of a system, the granularity of the servers and associated systems executing thereon should be made as small as possible. In this way, a workgroup, i.e. a portion of an entire workload, would be allocated to each and every system in the sysplex. Consequently, if a server or corresponding system were to fail, then only a minimal, and generally tolerable, loss of application throughput would be apparent to a customer.
To effectively employ such granularity, a technique has been developed by the present assignee that readily permits the failed system to be automatically and completely isolated from the shared data. This technique, commonly referred to as "fencing" can be invoked to isolate any failed system--regardless of whether the failure is in the O/S image or any application executing thereon. This technique is fully described in co-pending United States patent applications both by D. A. Elko et al entitled "Interdicting I/O and Messaging Operations in a Multi-System Complex" filed Mar. 30, 1992, and assigned Ser. No. 07/860,489 and entitled "Message Path Mechanism for Managing Connections Between Processors and a Coupling Facility", also filed Mar. 30, 1992 and assigned Ser. No. 07/860,646--collectively referred to herein as the Elko et al Fencing applications; both of which are also incorporated by reference herein. Through this technique, a hardware fencing facility is incorporated within each CPC in the sysplex. A common storage device, such as a DASD, that stores shared data and provides access thereto for each CPC, maintains a table, i.e. a so-called "couple dataset" of the current status of each CPC including the systems thereon. Periodically, each CPC interrogates the table to determine whether the status of each of the CPCs has been periodically updated and therethrough ascertain whether a corresponding system is operational or has failed. If an interrogating CPC detects a system failure, such as by detecting that a status update that should have periodically occurred, in fact, did not occur (a so-called "System Status Update Missing" condition), that CPC can generate a fence request to the fencing facility associated with the CPC that contains the non-operational system. Essentially and in response to this request, the fencing facility blocks all subsequent input/output (I/O) requests specified by the fence request and that affect the shared data, the data itself residing on either the DASD and/or a coupling facility.
In practice, customers, to best meet their own individual business needs, determine what workgroups are allocated to each system. As a result of various considerations involving system management overhead, storage overhead and the complexity of using multiple O/S images, customers exhibit a marked tendency to aggregate widely differing workgroups on a single O/S image, i.e. on a single system. Furthermore, a sysplex may be serving a wide user community for any given application. As such, the workgroups themselves that are executed on that one system, owing to their particular application mix and the specific work then executing against them, may possess widely differing response time requirements--let alone workgroup differences that occur from one system to another. In this regard, some applications, that are not particularly time sensitive, can execute on an interactive or batch basis (depending upon whether human interaction is needed or not), while critical business support applications (which, as noted, are highly time sensitive) execute on a real-time basis. Therefore, if, as is often the case, a single O/S image were to execute separate workgroups with widely differing time requirements, isolation would also need to extend to a lower level, i.e. the individual workgroups (or application(s)) themselves, than just to a system level. In this instance, if an application itself failed, then, e.g., a workgroup containing this application should be isolated ("fenced") without a necessity to isolate the entire system itself that is executing that application. As a result, the system would advantageously continue to process its remaining non-isolated workgroups, thereby providing enhanced sysplex throughput in the presence of an application or temporary system failure. To ensure needed data integrity, the time sensitive nature of critical business support applications mandates that a workgroup of these applications (or even a particular application itself) be immediately isolated in the event of its failure. However, workgroups of less time sensitive applications, could tolerate a delay (even one that is relatively long, either on the order of minutes or even hours) in accessing shared data, such as that required for the failure to be resolved, before being isolated from their shared data. Unfortunately, thusfar the art totally fails to teach how individual workgroup isolation can be accomplished.
In addition, apart from a failure occurring at an application level which requires sub-system (i.e. workload or application) fencing, hardware and other failures could occur in a sysplex that adversely affect a server or even an entire CPC. Inasmuch as such a failure, depending upon its nature, could also result in a corruption of the shared data, then the server or entire CPC, should, when necessary, be isolated from accessing the data,
Presently, the MVS O/S supports an I/O Prevention function which provides sub-system fencing, i.e. this function, when invoked, prevents a failed sub-system from invoking I/O operations. In particular, through this function, a sub-system can associate a so-called I/O Prevention identifier (IOPID) with an I/O operation. The IOPID contains a 7-bit index, into an I/O Prevention table (IOPT), and a 24-bit sequence number. The MVS O/S maintains the IOPT. Should a sub-system fail and to ensure data integrity, a functioning sub-system can request that the failed sub-system be prevented from undertaking any subsequent I/O operations. To make such a request, a functioning sub-system passes the IOPID of the failed sub-system to the MVS operating system which, in turn, determines whether the sequence number in the IOPID matches an entry in a corresponding indexed entry in the IOPT. If such a match occurs, the MVS O/S marks that IOPT entry as "not in use". Thereafter, whenever an I/O request containing that IOPID is passed to an I/O Supervisor in the MVS O/S, the Supervisor will fail that request if the corresponding IOPT entry is marked as "not in use" (i.e. the IOPID would be invalidated) or if the sequence numbers are unequal between that in the request and in the indexed IOPT entry. To completely process the request, the I/O Supervisor will also complete all active I/O operations with the failed IOPID, thereby purging the I/O devices of all such remaining requests. Consequently, once a valid I/O Prevention request against a failed sub-system has been fully processed, as set forth, then no I/O operations that specify the IOPID of the failed sub-system will be started. Employing a sequence number within the IOPID ensures that: (a) erroneous I/O Prevention requests are not honored, and (b) once a failed sub-system has had its I/O operations prevented, no further I/O operations with that IOPID will be started even if the same IOPT index value is reused.
Unfortunately, the I/O Prevention function, as presently implemented, presents two serious limitations. First, this function does not support hardware fencing. In that regard, the I/O Prevention function is software based. If the MVS O/S temporarily halts, the I/O Prevention function will simply not function at all. Second, this function is susceptible to erroneous so-called "back level" information. Specifically, a current trend in MVS computing is to run a CPC for a long period of time, e.g. on the order of weeks, without restarting the CPC and undertaking an initial program load (IPL). Given a relatively large number of different workloads and sub-systems on a CPC, the CPC may support a large number of different fenceable sub-systems that run under a single MVS O/S image. Disadvantageously, the IOPID field is only four bytes (32 bits) long. As a result, this field has proven to be just too small to contain both an sufficiently large index value and a sequence number to support a large number of different fenceable sub-systems. In that regard, the sequence number, being three bytes, is simply too short to prevent it from being exhausted and/or wrapping over the life of an MVS system, thus providing insufficient uniqueness for each fenceable sub-system. In the event the MVS O/S were to invalidate an IOPID for a given sub-system and then, due to a wrap in the sequence number, re-assign that IOPID to another sub-system (i.e. generating back level information), the IOPID for the former sub-system (being the same IOPID) would also become valid once again. As a result, I/O operations then issued by the formerly fenced sub-system would once again be permitted--clearly an undesirable condition. Also, if a IOPID assigned to a new sub-system were to wrap to a value associated with a currently fenced sub-system, then a fence and a prevention of I/O access would erroneously extend to the former sub-system. In practice, the length of the IOPID can not be easily enlarged owing to the adverse impact on existing software structures.
As one can see, the art has thusfar failed to teach a fencing technique that provides: (a) multi-level isolation, i.e. one which can function at varying levels of granularity including the application level, depending upon the type of sysplex component failure encountered, and (b) sufficient long term uniqueness for each one of a large number of fenceable entities.
Another conventional technique that provides sub-system fencing involves use of a "reserve log". Here, a protocol is established such that for any one particular sub-system to gain access to shared data, that system must first write an entry into a log. In doing so, that sub-system first obtains a so-called hardware reserve. While this sub-system holds the reserve, this sub-system effectively locks out any other sub-system from writing to the log and accessing the data. Unfortunately, this approach typically requires an I/O access, i.e. to the log, to occur prior to accessing the shared data through a coupling facility. Since an I/O access is typically several orders of magnitude slower than a coupling facility access, use of this approach can significantly slow the processing throughput of the sysplex.
Furthermore, in certain instances, the fencing technique disclosed in the Elko et al Fencing applications can also be disadvantageously quite time-consuming. Specifically, a system executing a given workgroup, depending upon the applications being executed therein, can be serving many users, for example as many as several hundred (if not more). Moreover, several workgroups could be served by this particular system. Since each user executes a process, the system can be executing quite a large number of user processes. Now, if this system, i.e. the target system, is to be fenced, all these processes would need to be simultaneously fenced through a hardware fencing facility. To do so, a CPC that generates a fence request also provides, as part of the request, a token that identifies a user process that is being executed on that system. This request, in turn, is routed through the coupling facility (which can be a so-called "structured external storage" SES device) to the fencing facility on a target CPC on which the target system resides. To actually isolate the target system, the target CPC would scan through its internal tables that list tokens associated with each and every access operation then occurring which involves the shared data storage device (e.g. a DASD or SES device). The token for each operation specifies which user process is then using that device. Such a fence request is needed for each and every token belonging to a process in the workgroup to be fenced. From a hardware perspective, the CPC would need to separately scan each and every respective I/O and SES operation it has for each and every token specified in the fence requests. Once a match is found for any one token in such a request, the status of that corresponding operation would be changed to discontinue the shared data access then being fenced and to prevent any further I/O or SES requests from being communicated, for the process being fenced, to the DASD or SES device. Unfortunately, repetitively scanning all the I/O and SES operations to locate token matches can be very time-consuming. If several hundred tokens are involved, each such an operation could consume upwards of 1 second or more. A delay of this sort in fully isolating a failed workload in processing critical business support applications may be excessively long in duration and hence permit some data corruption to occur, thereby be intolerable with these applications.
Therefore, a need currently exists in the art for a multi-level hierarchical fencing technique, specifically apparatus and an accompanying method, that can be used in a multi-system environment, such as illustratively a sysplex, and that not only provides sufficient uniqueness and granularity but also expeditiously isolates a failed sysplex component and thereby enhances the protection accorded to shared data.
In particular, in the event of a failure in a sysplex, this technique, based upon the nature of the failed sysplex component, should support complete shared data isolation at a variety of granular levels, particularly software fencing at workload or sub-system (e.g. workgroup or individual application) levels, and hardware fencing at a workload or system level. To support long term unattended CPC operation, this technique should also provide sufficient uniqueness for each one of a substantial number of fenceable entities. Furthermore, such a technique should not require operator intervention or utilize a separate service processor. Moreover, such a technique should dispense with any requirement to scan each and every active shared data operation multiple times.
We anticipate that, if such a multi-level fencing technique were to be incorporated into a sysplex, its use would advantageously increase the attractiveness of processing, inter alia, critical business support applications in a sysplex environment.