The following documents are hereby incorporated by reference in its entirety:
1. Object Oriented Programming, Coad P., and Nicola J., YourDon Press Computing Series, 1993., ISBN 0-13-032616-X. PA0 2. The C Programming Language, Kernighan B., and Ritchie D., 1st Edition, Prentice-Hall Inc., ISBN 0-13-110163-3 PA0 3. The Unix Programming Environment, Kernighan and Pike, Prentice-Hall Inc., ISBN 013-937699-2 PA0 4. Unix Network Programming, Stevens, Prentice Hall Software Series, 1990, ISBN 0-13-949876-1. PA0 5. Internetworking with TCP/IP, Volume I, Principles, Protocols, and Architecture, 2d Ed, Prentice Hall, 1991, ISBN 0-13468505-9 PA0 6. Solaris 1.1, SMCC VersionA, AnswerBook for SunOS 4.1.3 and Open Windows Version 3, Sun Microsystems Computer Corporation, Part Number 704-3183-10, Revision A. PA0 7. Artificial Intelligence, Rich E., McGraw-Hill, 1983, ISBN 0-07-052261-8. PA0 8. Artificial Intelligence, Winston P., 2d Edition, 1984, ISBN 0-201-082594. PA0 9. Documentation for the SunOS 4.1.3 operating system from Sun Microsystems, Inc. PA0 10. SunOS 4.1.3 manual pages ("man pages") from Sun Microsystems, Inc.
As used within this document and its accompaning drawings and figures, the following terms are to be construed in this manner:
1. "CPU" shall refer to the central processing unit of a computer if that computer has a single processing unit. If the computer has multiple processors, the term CPU shall refer to all the processing units of such a system. PA1 2. "Managing a computer" shall refer to the steps necessary to manage a computer, for example, gathering and storing information, analyzing information to detect conditions, and acting upon detected conditions. PA1 1. Data regarding the state of the computer is difficult to obtain. Typically, the system administrator must issue a variety of commands and consider several pieces of information from each command in order to diagnose a problem. If the system administrator is responsible for several machines, these commands must be repeated on each machine. PA1 2. When the system administrator detects a problem, the appropriate action plan may vary depending on a variety of external factors. For example, suppose a particular computer becomes slow and unresponsive when the system load on that computer crosses a certain threshold. If this problem occurs during normal business hours under ordinary circumstances, it will probably be a problem which must be resolved in a timely manner. On the other hand, suppose this problem occurs in the middle of the night. While this situation might still be a problem, the resolution need not be as timely since the organization's work will not be impacted, unless the problem still exists by the start of the business day. Now suppose the accounting department, at the end of each month, runs a processor intensive task to do the end-of-month accounting, which normally forces the load average above that threshold. If the system load crosses that same average during the time when the accounting department runs their end of month program, that's not a problem. In order to build a tool to handle situations like these using current tools would require writing a large series of inter-related complex boolean expressions. Unfortunately, writing and testing such a series of complex boolean expressions are difficult. PA1 3. Current system administration tools view the universe of computer problems as a static universe. Computer problems, however, evolve over time as hardware and software are added, removed, and replaced in a computer. PA1 4. Furthermore, an automated tool should also flexibly alter its behavior based on the nature of the commands a system administrator issues to it in guiding it in to resolve problems. Thus, if the system administrator routinely ignores a particular problem, the automated tool should warn the system administrator less frequently if the routinely ignored problem reoccurs.
The problem of system administration for a computer with a complex operating system such as the UNIX operating system is a complex one. For example, in the UNIX workstation market, it is common for an organization to hire one system administrator for every 20-50 workstations installed, with each such administrator costing a company (including salary and overhead) between $60,000 and $100,000. Indeed, some corporations have discovered that despite freezing or cutting back hardware and software purchases, the rising cost of retaining system administrators has nevertheless continued to escalate the cost of maintaining an Information Services organization at a substantial rate.
In a typical system administration environment, the work cycle consists of the following. A problem occurs on the computer which prevents the end user from carrying out some task. The end user detects that problem some time after it has occurred, and calls the complaint desk. The complaint desk dispatches a system administrator to diagnose and remedy the problem. This has three important consequences: First, problems are detected after they have blocked a user's work. This can be of substantial impact in organizations which use their computers to run their businesses. Second, problems which do not necessarily block a user's work, but which may nonetheless have important consequences, are difficult to detect. For example, one vendor supplies an electronic mail package which is dependent upon a functional mail daemon process. This mail daemon process has a tendency to die on an irregular, but frequent basis. In such situations, the end user typically does not realize that he is not capable of receiving electronic mail until after they've missed a meeting scheduled by electronic mail. Third, because problems are not detected until after they block a user's work, a problem which at an earlier state might have been easier to fix cannot be fixed until it has escalated into something more serious, and more difficult to correct.
Currently, system administrators manage a group of computers by performing most actions manually. Typically, the system administrator periodically issues a variety of commands to gather information regarding the state of the various computers in the group. Based upon the information gathered, and based upon a variety of non-computer information, the system administrator detects problems and formulates action plans to deal with the detected problems.
Automation of a system administration's task is difficult for several reasons:
What is needed is a tool which will automatically gather the necessary computer information to manage a group of computers, detect problems based upon the gathered information, inform the system administrator of detected problems, and automatically perform corrective actions to resolve detected problems.