The present disclosure relates generally to operation timing, and, in particular, to timing operations of different durations.
Mainframe computer systems, such as IBM's zSeries computing systems have evolved into extremely useful systems, in large part because of their adaptability to changing needs of enterprises. A zSeries system typically includes a mainframe, including Subchannel Control Blocks (SCBs), a Channel Subsystem (CSS), an I/O configuration, central processors, and a main storage. The CSS performs various functions/operations, including a Start Subchannel (SSCH) operation used to initiate the movement of data to and from main storage. CSS also performs a Halt SubChannel (HSCH) operation and a Clear Subchannel (CSCH) operation used mainly to reset devices and subchannel control blocks (SCBs) when I/O activity initiated by an SSCH instruction needs to be terminated, or the device needs a reset.
It has become important to time certain I/O operations within the CSS, including those related to SCB I/O operations. Two such I/O operations that are timed by the CSS are the HSCH operation and the CSCH operation. These operations are described from a functional viewpoint in the z/Architecture Principle of Operation (POP), IBM Corporation, September 2005, 5th Edition, SA22-7832-04, incorporated herein by reference.
Current timing methods have been useful for timing operations. However, with the introduction of systems with more sophisticated features, such as the Multiple Channel Subsystem (MCSS) feature on the z990 mainframe, the number of subchannels increases significantly for both the system as a whole and on a per channel basis. This increase in the number of subchannels and SCBs has made current timing mechanisms less useful.
To better understand the problems associated with the existing timing methods, consider how a CSCH operation is initiated and currently timed from a zSeries systems internal viewpoint. A conventional process for timing these instructions includes using existing SCBs in firmware accessible memory otherwise known as the hardware system area (HSA) 1002 and System Assist Processors (SAP) within the CSS 1000. If a software program running on a CP 1001 initiates a CSCH instruction, the CP firmware executing the CSCH instruction first sets the CSCH instruction into a SCB 1003 and then proceeds to queue that SCB on the bottom of a Work Queue (WQ) 1004. In the example shown in FIG. 1, the SCB is queued on the bottom of WQ z. The CP firmware then signals the SAP z, which is the “owner” of WQ z, with a “Halt/Clear signal” via control circuitry 1010. This signal indicates that one or more SCBs with these functions to process have been queued on the WQ z owned by SAP z. SAP z then searches WQ z, starting from the top of the queue, looking for a HSCH or CSCH instruction to process. Other SCBs with other types of I/O operations to process, such as Start Subchannel (SSCH), are skipped over. Once an SCB with HSCH or CSCH instruction is found, the SAP z removes that SCB from WQ z. In this example, the SCB 1003 with the CSCH is dequeued. If this is the first time the SCB 1003 has been dequeued from a WQ within the context of this CSCH, SAP z 1005 stores a Time Stamp (T/S) 1006 into the SCB. The time stamp may be derived from a current stamp as shown in detail FIG. 2. Details of steps included in a conventional timing process as described above and in the remainder of this Background portion are shown in FIG. 3.
Referring back to FIG. 1, SAP z then begins processing the CSCH instruction. If a Clear Signal needs to be sent to a device in the I/O configuration 1007 that is associated with the SCB being processed for the CSCH, SAP z performs an I/O path selection to determine the channel path identified by a channel path identifier (CHPID) number within the CSS 1000 that would drive the Clear Signal to the desired device via a channel connection 1009. If, for example, either CHPID 02 or CHPID y was selected, SAP z could proceed to signal the selected CHPID to perform the CSCH. This is because those CHPIDs have affinity to SAP z as indicated by the hash lines 1008. For further description of “Channel to SAP (or IOP)” affinity, the reader is described to U.S. Pat. No. 6,973,529. If CHPID 01 needs to be selected, the SCB with the CSCH function would need to be queued on the bottom of WQ 0 since CHPID 01 has affinity to SAP 0 in this example. Then SAP z would have to signal SAP 0 with a “Halt/Clear signal”. Once this SCB bubbled to the top of WQ 0, SAP 0 would then perform the same steps as was previously done by SAP z with the exception of inserting a T/S in the subchannel since the timing would already have been started. Hence, this re-queue on the WQ further elongates the elapsed time (ET) it takes to complete the CSCH operation.
If the SCB is on WQ 0, and SAP 0 dequeues it, SAP 0 instead computes the ET for this CSCH and compares it to the elapsed time limit (ETL) for a CSCH rather than storing a new function T/S in the SCB. Once the path is selected, the CHPID is signaled by SAP 0 via circuitry within the CSS to perform the CSCH operation. If the CHPID is busy and cannot accept the signal to perform the CSCH operation, SAP 0 puts the SCB back on WQ 0, and the process described in this paragraph is repeated. If this busy condition does not clear up after repeated attempts, the ET eventually exceeds the ETL, and recovery is invoked as shown in detail in FIG. 3.
If CHPID 01 is able to process the CSCH operation, SAP 0 still puts the SCB back on WQ 0 after signaling CHPID 01 to perform the CSCH, but with the Clear Issued State set within the SCB. At this point, the SCB is on the WQ merely to time the completion of the CSCH. When CHPID 01 completes the CSCH, which may involve sending a Clear Signal in the desired device, CHPID 01 signals SAP 0 that it has completed the CSCH. SAP 0 then dequeues the SCB from WQ 0 and reports the results of the CSCH operation back to a CP that, in turn, signals the software program that issued it. If some problem occurs that prevents the CHPID from signaling SAP 0 of completion of the CSCH operation, the ET eventually exceeds the ETL, and recovery is invoked for that CHPID as indicated in FIG. 3. Recovery results in a reset of that CHPID, which not only causes the CSCH to complete, but other operations on that channel also get reset, possibly resulting in one or more channel control check (CCC) operations.
CCC processing is a form of SCB recovery that is performed as the result of an error within the CSS that is encountered while the CSS is working with a particular subchannel. The high level description of CCC processing is also described in the above-referenced POP document. As in the case with the CSCH operation, SCBs with the CCC operation to perform are queued on a WQ. Also, as part of CCC processing, the channel is usually given the initiative to issue a Clear Signal to the device associated with the SCB. One difference between CSCH and CCC processing is that instead of an OS running on a CP initiating the CCC instruction as is done for a CSCH instruction, the CSS initiates the CCC processing. Nevertheless, the WQs are used to keep initiative to perform the CCC operation as well as time the completion of the operation. Also, requeuing the SCB on the WQ can occur for the same reasons as for the HSCH and CSCH operations (e.g., affinity, busy paths, and timing).
Another difference between CCC processing and CSCH processing is the ETL that is chosen. The recovery from the kinds of errors resulting in CCC usually involves resetting the entire channel that was working with the SCB at the time of the error prior to the actual processing of the CCC. Since the time to process a CCC on a reset channel is typically less than the time it takes to process a CSCH on a loaded channel, the ETL chosen for CCC processing to complete is typically set lower than the ETL for CSCH processing. So, for example, the ETL for CSCH processing is typically set at 14 seconds (as is the ETL for HSCH processing), while the ETL for CCC processing is typically set for 7 seconds. Thus, the code that does the timing needs to have logic to differentiate various ETLs depending on function.
FIG. 4 illustrates states of work queues and timing of subchannel control blocks in a conventional timing process. In FIG. 4, a WQ with a doubly linked list (DLL) with Top/Bottom pointers in the WQ header 4010 and Next/Previous pointers in the SCBs indicated by the arrows at 4020 are linked. Each SCB is shown to have either a SSCH, HSCH, CSCH or CCC operation to perform. Note that each SCB ET that is shown depicts the ET that would be computed using the algorithm in FIG. 3 which, in turn, uses the illustrated Current T/S 4000 and an arbitrary FUNC T/S in each SCB.
In this example, SCB 1 4001 is in the HSCH Not Issued state, which means the SAP has not yet signaled a CHPID to perform the HSCH. SCB 1 was put on WQ 0 for the first time for this HSCH, and it is shown to have no FUNC T/S value set at 4001. The FUNC T/S will be set into the SCB when the SAP dequeues the SCB off the WQ the first time for this HSCH.
SCB 2 4002 is in the HSCH Issued state, which means the HSCH was issued to a channel. The SAP re-queued the SCB back on the WQ after it issued the HSCH to a channel only as a means to give initiative to the SAP to time the HSCH operation. Completion occurs when the channel responds back to the SAP that the HSCH operation is completed, at which time the SAP will remove the SCB from the WQ. In the meantime, each time this SCB “bubbles” up to the top of the queue, the SAP dequeues it and computes the ET to determine if there is a timeout as illustrated in FIG. 3. If there is no timeout, the SCB is re-queued on the bottom of this WQ. This SCB would likely have been dequeued to check for a time out and re-queued multiple times, resulting in an ET of approximately 4-7 seconds.
SCB 3 4003 is in the CSCH Not Issued state. With a FUNC T/S set, the CSCH operation likely had been attempted to be issued to a channel. In this case, WQ 0 is being used to provide initiative for the SAP to keep attempting to issue the CSCH operation to a channel. Most likely, the selected channel was busy and the SCB had to be re-queued on the WQ to try later. The ET of 0.5 seconds at 4003 reflects that the SCB 3 had not been re-queued as many times as the SCB 2.
SCB 4 4004 is in a CSCH Issued state, which means the CSCH was issued to a channel. Like the HSCH Issued state for SCB 2, the SCB would be on the WQ as a means to time for completion of the CSCH operation. However, in this case, the ET of 14.2 seconds at 4005 has exceeded the ETL timeout value. It is likely that this SCB was re-queued on the WQ many more times than SCB 2 and SCB 3 were re-queued. Since the ET is over the 14 second ETL for this operation, when this SCB is dequeued, the SAP would then take the appropriate actions to recover both SCB 4 and the channel, which in turn would cause the CSCH operation to complete.
SCB 5 is in a CCC Issued state. Like the HSCH Issued and CSCH Issued states, the SCB 5 would be on the WQ as a means to time the CCC. However, for CCC functions, the ETL timeout is different than for the HSCH or CSCH functions, i.e., 7 seconds versus 14 seconds.
SCB 6 is in a SSCH Not Issued state. SSCH Instructions are not timed by the CSS. Thus, unlike HSCH, CSCH and CCC functions, once the SSCH is issued to the channel, the SCB is not put back on the WQ to be timed. Hence, there is less overhead per operation in terms of WQ utilization with regard to SSCHs once the SSCH is issued.
The speed at which a SSCH completes is an important benchmark measurement of a zSeries mainframe. Thus, it is important that any potential performance bottlenecks that slow down SSCH processing are minimized. Most of the time, the majority of the SCBs on the WQs typically have SSCH functions to process. However, at times, HSCH and CSCH instructions are used by the OS in recovery situations, and CCC processing is initiated by the CSS during CSS recovery operations.
As an example of a situation whereby the OS issues an unusually large number of HSCH instructions (and, possibly, CSCH instructions) is when a link failure outboard in the fabric is detected by a FICON channel and reported to the OS. When a link failure occurs, the OS performs device recovery for every device associated with that link. This may involve a large number of devices and a large number of subchannel control blocks, each of which would need to be issued a HSCH instruction and possibly a CSCH instruction. Even with a flurry of HSCH instructions to process, in the past there was no noticeable effect to SSCH performance on early mainframes.
However, with the introduction of more sophisticated mainframes with a significant increase in the number of subchannels and SCBs, the WQ performance may be impacted if it is used as a means to keep initiative for timing and processing functions, such as HSCH, CSCH and CCC operations. Adding to the WQ congestion is the need to re-queue the same SCB on the WQ multiple times for timing and/or to maintain initiative for the functions as discussed above. Also, the method of giving high priority to HSCH and CSCH by having the SAP search for SCBs on WQs with HSCH or CSCH functions pending further delays the processing SSCHs.
Accordingly, there is a need for a new operation timing technique that reduces WQ bottlenecks.