1.1 Introduction
The invention relates to a method of dynamically optimizing network control parameters in a Systems Network Architecture (SNA) network. For purposes of illustration, one implementation of the invention is described in connection with the well-known IBM Virtual Terminal Access Method (VTAM) software running on IBM or plug-compatible mainframe computers. It will be appreciated by those of ordinary skill having the benefit of this disclosure that the invention can be similarly implemented in other SNA-compliant networks, e.g., those involving an IBM AS/400 or similar computer.
The details of SNA are extensively documented in a variety of widely available publications and other references. The IBM publication "SNA Technical Overview," publication no. GC30-3073-3, hereafter "[SNATechOv]," is incorporated by reference as nonessential background information familiar to those of ordinary skill. Chapters 1 and 2 and the glossary of [SNATechOv] potentially are especially helpful.
The microfiche appendices, together which comprise 5 sheets of microfiche having 418 frames, submitted as part of this specification include a) Appendix 1, selected source code extracts from a commercial software package distributed by the assignee of this application under the trademark "OPERTUNE," as well as b) Appendix 2, a reference manual setting out detailed technical information for network administrators and distributed as part of the software package. Permission is granted to make copies of the microfiche appendices solely in connection with the making of facsimile copies of this application in accordance with applicable law; all other rights are reserved, and all other reproduction, distribution, creation of derivative works based on the contents, public display, and public performance of the microfiche appendices or any part thereof are prohibited by the copyright laws.
1.2 Overview of SNA Architecture & Glossary of Terms
The concepts discussed in the overview explanation that follows are illustrated in FIG. 1, which is a block diagram showing the hierarchical nature of the SNA architecture, and in FIG. 2, which is a functional block diagram showing a typical message path from an end-user at a terminal LU to a host computer P5 and back.
SNA was developed as a hierarchical architecture organized into groups that have specific functions. SNA "nodes" are collected into a structure of "networks," "domains," and "subareas" as explained in the following glossary of selected terms well known to those of ordinary skill:
37xx: the product number of a series of communication controllers introduced by IBM between 1975 and the present. These controllers are responsible for relieving the central processing unit of much of the burden of communications management. The 37xx series executes a software program called Network Control Program (NCP) that controls and directs communication controller activity. PA1 APPN: Advanced Peer-to-Peer Networking, a newer form of SNA communication whereby Physical Unit Type 2.1 nodes can initiate sessions with one another without going through VTAM. PA1 Bottleneck: a network problem that occurs when messages are entering one or more network components faster than they can be forwarded to their destinations. PA1 Boundary link: a link comprising part of a path between two SNA nodes and physically terminating in or attached to one of the nodes. See also Intermediate link. PA1 Channel: an SNA channel (sometimes referred to as a "370 data channel") is a communications path, largely local to a host computer and its on-site peripherals, that makes use of a specific SNA communications protocol. See generally [SNATechOv] FIG. 1-3. Channels are sometimes referred to colloquially as "channel attachments" attached to a host computer. The protocol used in channel attachments is characterized by comparatively high data throughput, e.g., 3 million bytes per second (Mbps) and higher. PA1 Controller: a communications controller (sometimes referred to as a "cluster controller") provides an interface between an SNA network and one or more end users at terminals to the SNA network. It buffers the entries that users make at their terminals. When polled by the NCP, the cluster controller delivers the buffers to the NCP. When the NCP selects and delivers messages to the cluster controller, the cluster controller receives the buffers and delivers each message to the correct terminal. PA1 CUA: Common User Access, a series of specifications for the interface between the user and applications executing on IBM and compatible mainframes. CUA specifies how information is presented, and how the user selects application options. PA1 Domain: all subareas that are controlled by a common VTAM (P5) node. PA1 FEP: Front End Processor, a name given to the 37xx series and compatible communication controllers. PA1 Intermediate link: a link comprising an intermediate part of a path between two SNA nodes but not physically terminating in or attached to either node. See also Boundary link. PA1 JCL: Job Control Language. PA1 Load module: a module of executable program code formatted to be loaded into a processor memory for execution. PA1 Link: a communications path between two nodes in an SNA network, normally operating in conformance with the Synchronous Data Link Control (SDLC) communications protocol. PA1 LU: logical unit. PA1 Modem delay: the time required for the modem circuitry to modulate and demodulate digital information within the sending and receiving modems. A typical modem delay is from 15 to 50 milliseconds per modem pair per transmission. PA1 MVS: IBM's Multiple Virtual Storage operating system. PA1 NCP: see Network Control Program. PA1 Network: all connected domains. PA1 Network architecture: the rules that govern the services, functions, and protocols of network components. A widely used network architecture is the Systems Network Architecture (SNA) developed by IBM. PA1 Network Control Program (NCP): an IBM computer program that executes in the controller hardware to perform the work of network communication for remote connections. The NCP polls the cluster controllers to send and receive messages and controls dialing and answering modems that are attached to it. The NCP routes messages that are destined for other subareas. PA1 Network resource: the speed or capacity of a physical network component that is needed by network users to move data from one point to another in a network. PA1 Network user: an end user or application that requires network resources to complete assigned tasks. PA1 Node: a set of hardware, and the software associated with that hardware, that implements all seven standard layers of the SNA architecture (which are physical control, data link control, path control, transmission control, data flow control, presentation services, and transaction services). PA1 Overutilization: a network problem that occurs when the number of network users exceeds the capacity of a network resource. Network users must either accept less of the resource than they requested, or wait longer to acquire the requested amount. See also Underutilization. PA1 Propagation delay: the amount of time required for electrical signals or electromagnetic waves to move from one end of a link to another. The propagation delay for a 300 foot (91.5 meter) cable is about 4 microseconds. The propagation delay from a ground station to a satellite in geosynchronous orbit is about 150 milliseconds. PA1 P2: a designation for a cluster controller. PA1 P4: a designation for an NCP node. PA1 P5: a designation for a VTAM node. PA1 Path information unit: a unit of message traffic. PA1 PIU: path information unit. PA1 PU: physical unit. PA1 Queuing time: the time spent waiting for access to a network resource. Queuing time can be one of the largest components of response time. Queuing time is dependent on the level of network activity. Queuing time is typically expressed as a multiple of the transmission time; it is closely related to the transmission time and to the level of network activity, which is expressed as the percentage of line utilization. Queuing time is a large component of response time and one of the most likely causes of a response-time problem symptom. PA1 Response time: the time required for an entry from a network end point (such as a user terminal) to travel the network to a host, complete processing within the host, and travel back to the network end point. From a network user's perspective, response time is the interval between pressing the Enter key at a terminal or station and receiving a ready-for-additional-commands prompt in reply. In most cases, the travel time between the host and the end point is the largest component of response time. PA1 SDLC: Synchronous Data Link Control. PA1 Session: a connection between two logical units (e.g., two applications or an application and an end user) that establishes the rules and a path for communication between the two. Except for Advanced Peer-to-Peer Networking (APPN), all sessions are initiated through a host processor executing VTAM. Two logical units that are connected in this way are often referred to as being "in session." PA1 SNA: Systems Network Architecture. PA1 Subarea: a VTAM or NCP node (P5 or P4) and all the cluster controllers (P2s) or token rings that are attached and controlled by it. PA1 TG: Transmission Group, an SNA definition that allows one or more SDLC links between adjacent communications controllers to be used as a single logical link. (A single System/370 channel can also be a transmission group.) PA1 Think time: the time required for an end user to respond to a prompt from a terminal with an action. Studies have shown that think time varies with the terminal response time, and that as response time decreases to less than a second, and again to under half a second, think time decreases at an even faster rate. PA1 Throughput: the amount of data that can be sent through the network in a given period of time. Throughput is sometimes confused with response time, which indicates how fast a single operation occurs. A network that can transfer 2 Megabytes of information in a second has twice the throughput of a network that can transfer 1 Megabyte per second. (The response time for each transfer, 1 second, is the same.) PA1 Transmission time: the time required to move a message from the sending component to the receiving component within a network. Transmission time is composed of the baud rate or bits per second rate of the line, the time required for the link protocol, the time required for the routing process headers and trailers, the character code length, and the message length. Transmission time can typically vary from milliseconds to seconds. PA1 Turn-around time: the time required for a network component to change from one mode of transmission (sending or receiving) to another. Turn-around time is unique to half-duplex circuits and/or operations. PA1 Underutilization: a network problem that occurs when much of the capacity of a network resource is not needed by network users and is being wasted. (See also Overutilization.) PA1 DOS/VSE: Disk Operating System/Virtual Storage Extended. A mainframe operating system developed by IBM that is an extension of an earlier operating system, Disk Operating System/Virtual Storage (DOS/VS). PA1 VR: Virtual Route, an SNA definition that allows logical routes based on transmission priorities to be mapped to the real connections (explicit routes or ERs) that exist between two subareas. PA1 VTAM: Virtual Telecommunications Access Method software. VTAM executing in a host processor system controls the interface between host applications and the network. It also maintains the domain configuration and initiates communications called "sessions" between the other network components. (A newer type of SNA component, the PU Type 2.1 node, can initiate sessions with another PU Type 2.1 node without VTAM intervention in a process is called Advanced Peer-to-Peer Networking or "APPN.") PA1 XMT: transmission time.
1.3 Initialization of an SNA Network
A key aspect of the background of the invention is the manner in which initialization of an SNA network is normally accomplished. As is well known to those of ordinary skill, during initialization of a network a customized NCP program "load module" (executable program) is created for each communications controller or FEP by a network administrator who runs one or more utility programs to link selected program components together. The network administrator's customization of each load module includes selecting appropriate values for various network tuning parameters that are discussed in more detail below.
Load modules are selectively downloaded from a host computer running VTAM to one or more selected controllers on the network over a channel or a link. The controller stores the load module into local storage and formats its remaining storage for use as buffers for incoming and outgoing traffic. After the controller buffers are initialized, VTAM sends an "activation attempt" request message to the controller, which takes actions required to activate devices in its domain or subarea.
Importantly in the context of the invention, a controller being initialized is not operational to service network user requests during the downloading process (and also during the linking process if the controller has not been previously initialized). Thus, neither are any terminals or other devices that communicate with the host computer via the controller available for use on the network during controller initialization.
1.4 Overview of Selected SNA Network Operation Aspects
During normal network operations, a terminal device may send a message requesting that a "session" be established with an application program executing on the host computer system. The request for a session is relayed from the terminal device via one or more controllers (e.g., across zero or more intermediate links and a boundary link) to VTAM executing on the host computer. VTAM negotiates with the application program to establish the session and returns a session-establishment message to the terminal. The session then has a "virtual route"--a predefined route--over which message traffic can flow between the terminal device and the application program and vice versa.
If any link in a session's virtual route is lost, the entire session abnormally terminates. Some link redundancy may be available, however: a virtual route is assigned to a transmission group (TG) for each leg of its path, and a transmission group may be single- or multi-link. If one link in a multi-link transmission group fails, another link takes over without disruption of the virtual route. Session establishment is insensitive to the actual path.
A problem can arise from this approach if the links in a multi-link group have different speeds or throughput capabilities. If a slow link is activated first (or if a fast link goes down, is replaced by a slower link, but later is brought back up), the faster link will be selected for use only if the slower one is busy, even if the faster link is available, because the NCP keeps no knowledge, in this context, of what the fastest line is.
Abnormal or emergency operations of an SNA network notably include a buffer shortage on a communications controller. Controllers maintain several buffer pools, any of which can suffer a shortage. If a shortage occurs in an intermediate-link controller, the controller's response to the shortage typically is to slow down incoming traffic by setting a flag-type bit in a control block. If a shortage occurs in a boundary-link controller, the controller typically sends a RECEIVE NOT READY message, which also slows down incoming traffic, and additionally tries to speed up outgoing traffic.
Shutdown of a communications controller can occur, e.g., for routine or emergency maintenance, for replacement or augmentation of physical components, or for reconfiguration of the network, domain, or subarea. As part of the shutdown process, the controller's NCP sends a shutdown notification message to its domain or subarea devices and if possible to the host. At that point all virtual routes including that controller are lost. Each VTAM that "owns" such a virtual route is assumed to be responsible for knowing what virtual routes are dependent on that controller and to notify application programs (or other VTAMs that are making use of cross-domain messages) that were using the controller as part of their virtual routes that the route is lost.
1.5 Response Time and SNA Network Performance Limitations
As SNA networks grow, network performance is affected by imbalances between network resources and the needs of the network users. Such imbalances can create response time problems. When network user needs exceed network resource capabilities, network users must either accept less network service than they need or must wait longer to receive it. Network problems of any size can have a tremendous impact on the network's ability to move messages freely from source to destination.
The symptoms of an SNA network problem may be either external or internal. External symptoms can be observed by anyone who uses the network. The most obvious external symptom is a longer response time. Internal symptoms can only be observed using network performance tools. These symptoms may be labeled as bottlenecks, over- and underutilizations, and throughput problems.
It can be difficult to find the cause of a network problem because the symptoms are often inconsistent. Symptoms can appear gradually, suddenly, individually, or in combination, move from component to component, or appear and vanish for no apparent reason.
A network problem will frequently exhibit a response time or availability problem. Consequently, response time is a frequent starting point for identifying the cause of a network problem. Referring to FIG. 2, response time is an accumulation of time intervals contributed by each network component through which a message passes. Generally speaking, response time can be summarized as the aggregate of the transit time from the user to the processor (including modem delays, propagation delays, and queuing times), the turnaround time within the processor (normally insignificant in response-time calculations), and the transit time from the processor back to the user. Some of these time intervals are essentially constant, such as the line speed between two components. Other intervals are variable, such as queuing delay, the amount of time a message must wait in line behind other messages before departure for the next component along the route.
Queuing time is a primary component of response time. It is related to the level of network activity and thus to the balance between network users and network resources. When line utilization is low, most of the line capacity is available for transmitting messages so queuing time is also low. (When line utilization is too low, valuable line resources are being wasted.) When line utilization is high, little capacity is available for additional transmissions. The transmitting process is likely to find the line busy when it attempts the transmission. Messages accumulate in a queue and queuing time increases sharply. Line utilization and its relationship to queuing time indicates an imbalance between network users and network resources. When an imbalance occurs, line utilization and queuing time increase, and end users notice longer response times.
1.6 Potential Solutions to Network Response-Time Problems
Three possible solutions for network problems are hardware upgrades, load reduction or balancing, and tuning of the network. Each solution has positive and negative aspects.
Hardware upgrades increase capacity and/or speed. Increased capacity reduces line utilization by providing more paths, whereas increased speed reduces transmission time and hence queuing time by providing a faster path.
Load reduction lowers line utilization by reducing line traffic, e.g., through the use of data compression techniques, but does not necessarily address the underlying causes of internal network problems which thus can reappear when traffic increases to its former level. Load balancing lowers line utilization by changing the relationships between network users and network resources. Load balancing can be accomplished by internal balancing, which changes the network configuration to distribute the network users more evenly across the network resources, or by external balancing, which changes the network usage patterns to distribute the network users more equitably over time.
Tuning is a method of improving network performance by adjusting the parameters that influence network characteristics, as discussed in more detail in the following subsections.
1.7 NCP Tuning Parameters
A number of NCP parameters may be "tuned" to optimize SNA network performance. A detailed description of numerous selected tuning parameters is set out in the reference manual reproduced in microfiche Appendix 2, especially in Appendix B thereof.
Tuning is potentially the most economical solution for network problems because it can obtain optimum performance from existing network resources before making costly upgrades or disruptive redistributions. Moreover, a well-tuned network can actually make it easier to identify when upgrades and redistributions are needed and where they should be implemented.
Tuning parameters fall into several categories relating to (1) traffic workload, e.g., whether a communications line is used heavily or comparatively little; (2) traffic patterns, e.g., the extent to which traffic consists primarily of interactive transmissions vs. batch transmissions; (3) resource consumption, e.g., parameters limiting consumption of node resources such as buffers and CPU availability ("CPU" is more precisely denoted "CCU" or central control unit) in a controller; and (4) error handling and recovery.
For example, the MAXOUT parameter relates to the fact that on SDLC links and token ring links, a message counter is assigned to every message that goes out. The MAXOUT parameter, set at system generation time for the controller's NCP load module, establishes a maximum count of messages allowed to go out to a terminal attached to the controller before an acknowledgement comes back. Referring to FIG. 1, assume for example that ten messages are queued up to be sent by a controller P2, in a specified sequence, to an attached terminal device LU and the MAXOUT parameter for the controller is seven. After seven messages are sent out, the controller P2's NCP sends an "are you there?" poll message to the terminal device LU, which responds with an identifier of the last message that it received in proper sequence; any message sent out after that last is assumed to have been lost and is retransmitted. This gives the NCP positive confirmation of receipt, explicitly or implicitly, because a response from a terminal device that "I received message 3" implies that messages 1 and 2 arrived as well.
MAXOUT is a parameter that is set at system generation time for the NCP. It normally cannot be adjusted up or down for improved or degraded line conditions without regenerating the NCP, i.e., reinitializing the controller.
As another example, the parameter PASSLIM relates to a multi-drop line such as that shown in FIG. 2. In some installations an SDLC link (shown in the drawing as 37xx) will have connected to it a plurality of physical unit PU connections (shown as 3.times.74s), sometimes referred to colloquially as "drops." A parameter PASSLIM is used to implement "timesharing" of the network among the different drops. That parameter controls the maximum number of messages that will be sent to a particular drop before suspending the message traffic to that drop and beginning to send pending messages to another drop. That helps prevent slower drops on a multi-drop line (e.g., those attached to some printers and other batch-type device) from tying up the line.
Still another example is the segment size parameter, which affects the size of the segments into which network message traffic is divided to fit into, e.g., the buffer size of the receiving device. The permissible segment size might increase, e.g., as device hardware is upgraded, but the segment size cannot be increased without reloading the controller NCP. That would entail "cycling" (taking off line, then returning on line) all devices associated with the controller as well as any intermediate links that were dependent on that controller. Inasmuch as availability of network links is a major practical consideration, changes of that kind are not feasible during normal operations.
1.8 Difficulties of SNA Network Tuning
Tuning of an SNA network is a nontrivial task. It calls for knowledge of the configuration and usage patterns of the network and an understanding of the effects of the available tuning parameters. Tuning entails steps such as (1) collecting network statistical data, (2) analyzing the data to isolate any underlying problems, (3) selecting an appropriate course of action for tuning, and (4) implementing the selected tuning actions.
Equally important, tuning is a cyclic process whose usefulness depends in large part on how quickly the above-described four steps can be completed. Network activity can change significantly in minutes, but the underlying assumption of tuning--that past network performance is a useful predictor of future performance--is true only when tuning can be completed before network activity changes significantly.
Conventional tuning of an SNA network can be difficult because, among other reasons, (a) NCP parameter changes require program regeneration, reloading, and reactivation in all affected communication controllers; (b) the tuning process often takes too long to be of any real use before network conditions make the particular tuning obsolete; (c) tuning is based on past activity rather than current activity and often entails tuning to the average rather than to the high and low levels of network activity; (d) tuning requires specialized knowledge of network configurations, NCP parameters, and equipment specifications; (e) changes often cannot be made quickly enough to optimize the major interactive and batch shifts that can occur in a network over a typical 24-hour period; (f) tuning must be constantly repeated as network configuration and work loads change. As a result, in many installations the network is tuned manually for "average" conditions but is not optimized for existing conditions at any given time.
Moreover, network tuning is never "finished." Even if a network administrator succeeds in perfectly tuning the network, the network configuration and network activity often change so quickly that re-tuning is required. For example, software upgrades and new software packages change the amount of end user activity and the load distribution. Hardware failures change network activity and the load distribution as messages are routed around the failed equipment. New hardware and hardware upgrades change the load distribution. Adding or moving end users to different points on the network change network activity and the network load. Separate scheduling of batch and interactive sessions change network activity. Network tuning changes activity and the load distribution. Some of these changes are unanticipated and unwanted, requiring additional tuning changes.
In short, SNA network tuning is very much an iterative process. The effectiveness of tuning, and thus its usefulness as a network management technique, may well depend on how quickly each iteration can be planned and completed.