Effectuating efficient processing in the day-to-day operation of a large-scale software system has conventionally required staff personnel with expertise to interpret manifestations of system performance, and has consumed a significant amount of staff time and financial resources to identify actual and potential problem areas.
System performance is generally gauged by two very different measures, known respectively as system response time and system throughput. The former measure relates to the speed of responding to a single system command, whereas the latter means the efficiency of processing large amounts of data. Balancing these measures is the domain of the system expert, and overall efficiency depends on carefully arranging both the hardware and software that does the processing as well as the information that is stored in the system's databases. The procedure of allocating resources so that system processing is shared by the resources on a balanced basis is called "tuning."
Tuning of this type, namely, allocation of resources by a human system expert, is still carried out despite the advent of computerized, autonomous resource managers designed to optimize performance. These computerized managers are typically part of the operating system software and are designed to service general work loads, that as a rule are not backlogged. Representative of such computerized managers of the non-backlogged type is the subject matter of the disclosure of Watanabe et al. U.S. Pat. No. 4,890,227. For the non-back-logged case, it is possible to optimize both response time and throughput. For the backlogged case, improvements in response time come at the expense of throughput, with the converse also being true. However, these managers operate under the handicap that the current allocation of files to disks is outside the scope of their optimization. Also, of necessity, these managers lack application-specific knowledge. This limitation precludes them from including in their optimizations other, non-performance related concerns. A prime example of one such concern is allocating files to disks to minimize the impact on the system as a whole in the event that a particular disk fails. This concern is referred to here as developing a damage limitation policy and its implementation has a direct bearing on system availability. Finally, these managers do not attempt to regulate the flow rate of transactions into the system. Consequently, their optimization is a local, not a global optimization.
To place the response time and system throughput measures and their ramifications in a practical setting, reference is made to an illustrative example of a large-scale software system, designated the FACS system, which finds widespread use in the telecommunications environment. The FACS system assigns and inventories telephone outside plant equipment (e.g. cable pairs) and central office facilities (e.g. cable appearances on a main distribution frame). Currently, the FACS system is embodied in approximately 1 million lines of source code. It runs on a mainframe or host computer composed of a CPU complex with 2-4 processors, an I/O system containing 6-8 dual disk controllers, and 60-80 600 million byte disks. Because of the complexity and size of the FACS system as well as its sophisticated execution environment, operating the FACS system with acceptable response time while maintaining high system throughput is an on-going challenge which requires tuning skills of an expert to achieve a high level of system performance.
Formulating a thorough diagnosis and just one possible remedy for performance problems in such a large system typically takes expert analysts several days. Starting with performance symptoms, analysts manually tuning a FACS system first deduce which of several kinds of data they need to analyze the problems. Then they apply formulas and guidelines based on their own experience to arrive at a basic understanding of the problem areas--for instance, occasionally transactions stack up in queues leading to inefficient use of central-processing resources. Next, the analysts cull the data, searching for specific explanations for the degradation of performance. The final step, identifying solutions, again calls for using so much knowledge and data that short cuts based on past experience are a practical necessity. Of course, once changes are made, another cycle of analysis must be undertaken to verify that problems are corrected. Because the analysis is so time consuming and difficult, performance issues are often addressed only after system performance has degraded.
When systems of this size go awry, there are typically many symptoms to analyze. It is difficult to isolate those that are truly performance affecting from those that merely appear to affect performance. To cull the important symptoms, and then synthesize assessments of the current state of the system requires an understanding of how a symptom (such as a large number of concurrently active processes) affects the users' perception of the responsiveness of the system as a whole. Developing this view requires deep analysis, facilitated by the mathematics of queueing theory. The analysis techniques themselves are difficult to understand and to properly interpret the results obtained from them requires insight into the dynamics of the underlying system.