The data processing resources of business organizations are increasingly taking the form of a distributed computing environment in which data and processing are dispersed over a network comprising many interconnected, heterogeneous and geographically remote computers. Among the reasons for this approach are to offload non-mission-critical processing from the mainframe, to provide a pragmatic alternative to centralized corporate databases, to establish a single computing environment, to move control into the operating divisions of the company, and to avoid having a single point of failure. For example, many business entities have one client/server network installed in each regional office, in which a high-capacity computer system operates as the "server" supporting many lower-capacity "client" desktop computers. The servers in such a business entity are also commonly connected to one another by a higher-level network known as a wide area network. In this manner, users at any location within the business entity can theoretically access resources available in the company's network regardless of where the resource is located.
The flexibility gained for users with this type of arrangement comes with a price, however. It is very difficult to manage such a diverse and widely-dispersed network for many reasons. Servers installed in the wide area network are frequently not all of the same variety. One regional office may be using an IBM machine with a UNIX operating system, while another regional office may be using a DEC machine with a VMS operating system. Also, applications present on the servers throughout the network vary not only in terms of type, but also product release level within an application type. Moreover, the applications available are changed frequently by users throughout the network, and failure events in such a network are usually difficult to catch until after a failure has already occurred. Thus, a need exists for an efficient and flexible enterprise management system.
By way of background, one computer network management system was implemented in the fashion shown schematically in FIG. 1. In FIG. 1, a network management computer system 10 is coupled via network 12 to server computer system 14 and a plurality of other server computer systems. The hardware present in each of the computer systems may be of any conventional type such as is typically found on server computers in a client/server network environment. Moreover, the hardware configuration of each of the computer systems need not be the same. For example, network management computer system 10 might be built around a computer sold by International Business Machines Corporation operating with the well-known UNIX operating system, while server computer system 14 might be built around a computer sold by Digital Equipment Corporation operating with the well-known VMS operating system. The other server computer systems in the network might be built around yet other hardware/software platforms. In addition, all of the server computers in the network might be coupled to a variety of supported client computers such as desk-top computers, workstations and other resources. It is anticipated, however, in FIG. 1 that network management computer system 10 and each of the server computer systems in the network will be equipped with some sort of CPU 16, 18, some sort of conventional input/output equipment 20, 22 such as a keyboard and a display monitor, some sort of conventional data storage device 24, 26 such as a disk or tape drive or CD ROM drive, some sort of random access memory ("RAM") 28, 29, and some sort of conventional network communication hardware 30, 32 such as an ETHERNET interface unit for physically coupling the computer system to network 12. In the system of FI6. 1, network 12 may be implemented using any conventional network protocol such as TCP/IP. In the configuration shown in FIG. 1, a manager software system 34 is stored on storage device 24 in network management computer system 10; one agent software system is installed on each of the server computer systems in the network, such as agent software system 36 shown stored on storage device 26 in server computer system 14; at least one knowledge module 38 is stored on storage device 24 in network management computer system 10; and at least one script program 40, 42 is stored on each of the storage devices 24, 26 throughout the computer network.
FIG. 2 illustrates the main components for implementing the manager software system 34 shown in the system of FIG. 1. Knowledge module parser 44 is responsible for accessing knowledge module 38 and parsing the information therein for use by knowledge database manager 46, which in turn creates and maintains a database 47 of knowledge that is more readily useable by manager software system 34 than would be the data stored in knowledge module 38. Object database manager 48 creates and maintains a database 49 representing all of the resources and applications (collectively, "objects") present on the computer network, as well as information pertaining to the state of those objects, in a form that will be readily useable by a graphical user interface module 50. Databases 47 and 49 may be stored in RAM or on a storage device such as a hard disk. Graphical user interface 50 is responsible for communicating with display driver software in order to present visual representations of objects on the display of network management computer system 10. Such representations typically take the form of icons for objects. Also, graphical user interface module 50 coordinates the representation of pop-up windows for command menus and the display of requested or monitored data. Event manager 52 is responsible for keeping a record of various occurrences throughout the computer network, such as the occurrence of alarm conditions and their resolution, for the purpose of record keeping and management convenience. Interface 54 is for the purpose of interfacing with network management software other than the manager software system 34 and agent software system 36. For example, users of network management computer system 10 may make use of software such as Hewlett Packard Corporation's OPENVIEW product for the purpose of monitoring low-level network conditions such as broken physical connections. While using such a third-party product, the user may open a window and request information from manager software system 34, in which case interface 54 will coordinate communication between manager software system 34 and such third party product. Communications module 56 is responsible for handling all communications to and from agent software systems installed throughout the computer network. Script program compiler 58 is used when the user of manager software system 34 wishes to develop script programs for use in customizing the network management system. Kernel 60 represents all other miscellaneous functions within manager software system 34, such as coordinating the action of the above-named modules and the communications between them.
FIG. 3 illustrates the main components of the agent software system 36 shown in FIG. 1. Communications module 62 coordinates message communications to and from other computers, such as network management computer system 10, and parses the information contained in such messages. Script program compiler 64 is responsible for compiling script programs. Such compilation is only partial, however, resulting in an intermediate code that is not directly executable, but that is interpretable by script program interpreter 66. Command execution manager 68 is responsible for coordinating the execution of commands dictated from within agent software system 36 by any of its components. Depending on the command type, executions of such commands may entail the use of operating system commands available on the host server computer, or such commands may entail the interpretation of script programs as will be further described below. Run queue scheduler 70 maintains a list of runable jobs or commands, together with the times at which they should be run and their desired frequency. By checking a timer within agent software system 36, run queue scheduler 70 is capable of "waking up" at appropriate times to route runable jobs or commands to command execution manager 68. Dispatcher 72 is responsible for routing information to and from the appropriate modules within agent software system 36, and generally performs a coordinating function similar in nature to that of kernel 60 in manager software system 34. Knowledge database manager 74 creates and maintains a database 75, either in RAM or on a storage device such as a hard disk, containing knowledge received via messages from manager software system 34. The knowledge maintained in agent's database 75 differs from the knowledge contained in manager's database 47, however, in that agent's database 75 typically does not contain information pertinent to the display of information on the manager's console. Process cache manager 76 creates and maintains process cache 77, which is typically stored in RAM. Agent software system 36 fills process cache 77 periodically with information concerning the processes that are present on the host server computer at any given moment. Process cache 77 is also accessed by other modules within agent software system 36, such as application discovery manager 78, for providing some of the input information used to determine whether certain resources are present on the host server. Parameter and recovery action manager 80 is responsible for monitoring certain aspects of resources on the server computer, such as "disk space remaining," for example, and is responsible for taking automatic actions to recover from alarm levels for such resources, as will be discussed below.
FIG. 4 is a diagrammatic illustration of the types of information that is typically stored in a knowledge module 38 and in knowledge databases such as databases 47 and 75. (Note that knowledge module 38 is usually stored in the form of a data file containing ASCII text.) There are two basic broad categories of information represented in a knowledge module. Category 92 comprises information related to computers that may be present on any given network. Category 92 includes information in categories 82, 84 and 86. Category 94 comprises information related to applications that might be present on the computers in any given network. Category 94 includes information in categories 88 and 90. As can be seen in categories 82 and 88, various types of information may be stored in a knowledge module, such as information relating to environment, parameters, command types, commands, setup commands, "infobox" commands, and discovery. For example, environment information includes values for environment variables that are used to execute certain commands. Parameter information pertains to certain aspects of a computer or application that are to be monitored, such as "number of users logged in." "Command type" information tells an agent software system how to execute a given command. ("Command type" information might indicate that a given command is type "operating system," or type SQL, or that the command is actually a script program.) "Command" information, proper, is associated with the definition of a command, i.e., the text of the actual command, and contains information displayed in a command menu at the network manager's console. Setup commands are those that are to be executed whenever the manager software system 10 establishes a connection with an agent software system 36. Infobox command information relates to the format for displaying command output in "pop-up" information windows at the manager's console. Discovery information relates to which application classes are desired to be searched for, and also to the names and locations of the script programs required to do the searching.
Note that, in knowledge module 38, the above categories of information are arranged in a hierarchy, such that information in category 82 will apply to all computers (for example, IBM and DEC computers), unless overridden by information in category 84 or 86. By the same token, information in category 84 would apply to all instances of a given class of computers (for example, all computers using the UNIX operating system), unless overridden by information in category 86. Information in category 86 would apply only to certain instances of computers in a given class. (For example, the UNIX computers at the Dallas and Houston nodes in a wide area network would represent two different instances within the UNIX computer class.) Similarly, categories 88 and 90 represent a hierarchy of infomation: Information in category 88 would apply to all applications in a given class of applications, unless overridden by information in category 90 pertaining to a specific application instance within the class. (For example, one application class might contain information relating to all instances of version 7 of Oracle Corporation's ORACLE database management system, while another class might contain information relating to all instances of version 6 of that company's database management system.) Information in category 90 would apply only to certain instances of the applications in a class, for example the ORACLE version 7 database present on a certain server computer system within the network.
It should be noted that, in the network management system of FIG. 1, only information types pertinent to a particular server are sent by management software system 10 to the agent software system 36 installed on that server, but such pertinent information might include infomation from all of the above categories.
FIG. 5a, which is continued in FIG. 5b, is an excerpt from an actual knowledge module 38. FIG. 6a, which is continued in FIG. 6b, is an excerpt from an actual script program such as would be typical for script programs 40 and 42. Script programs are written in an interpretable language. In the network management system of FIG. 1, script programs are stored in network management computer system 10 and server computer system 14 in their uninterpreted form, usually in the form of an ASCII text file. In the network management system of FIG. 1, when a script program 42 is used for the first time by agent software system 14, it is compiled and interpreted. Thereafter, the compiled version of script program 42 is stored so that the next time it is required it may simply be interpreted from its intermediate form rather than being compiled again. As can be seen from the example, a script program written in an interpretable language can be used to define a command or routine, such as (in this example) a routine for collecting information and determining the number of users logged into a particular server computer system 14 as well as the number of processes per user. Any highlevel language definition could be used to write the script programs for use in the system of FIG. 1, provided that the language definition enabled the programmer to: (1) execute external commands, (2) access system files, (3) communicate information about the existence and status of resources, (4) allow the exchange of information between processes, and (5) query and update a knowledge database such as databases 47 and 75.
FIG. 7 is a flow diagram showing how the network management system of FIG. 1 was initialized; FIG. 8 is a flow diagram illustrating how the network management system of FIG. 1 was used to discover resources on a server computer system; FIG. 9 is a flow diagram illustrating how the network management system of FIG. 1 was used to monitor resources on a server computer system; and FIG. 10 is a flow diagram illustrating how the network management system of FIG. 1 was used to execute recovery actions relating to the resources on a server computer system.
While the above-described network management system successfully addressed numerous important problems in the art, it did not address certain other problems. One such problem is that of scalability. It is desirable in a large network to use numerous network management computer systems 10, each running its own manager software 34 or "console" process, and to have agent processes in the network numbering in the thousands. In network systems like that shown in FIG. 1, however, a separate agent process is required in server computer system 14 with its own knowledge database 75 each time a new manager software process or "console" process begins to monitor the resources on server computer system 14. Therefore, multiple agents would exist on the same server in order to support multiple consoles. This soon begins to tax the memory and CPU resources of server computer system 14, decreasing the server capacity available for other applications.
Additionally, agent software system 36 in the system of FIG. 1 is dependent upon manager software 34 in at least two senses. First, knowledge must be transmitted by manager 34 to agent 36 when manager 34 desires to begin monitoring resources on server computer system 14, resulting in a large flurry of network traffic. Second, if no manager or console process exists to support agent software system 36, then resources on server computer system 14 will go un-monitored.
Another class of network management systems have been implemented according to the well-known Simple Network Management Protocol (hereinafter "SNMP") as described, for example, in Marshall T. Rose, The Simple Book (2d ed., PTR Prentice-Hall, Inc., 1994). The SNMP protocol specifies that only one agent will exist on a given managed node in a network regardless of the number of console processes interested in monitoring the resources associated with the node. The SNMP protocol is designed such that a set of information called a Management Information Base (hereinafter "MIB") will be locally available in storage for each such agent in the network. The MIB acts to define the objects, or resources, that can be monitored using the SNMP protocol. In operation, an SNMP agent will monitor objects associated with its node in accordance with the information comprising the MIB independently of the existence of a console process interested in the objects. However, an SNMP system is inefficient and inflexible in that a console must request information from the agent about objects on a piecemeal basis, one request per piece of information, causing increased network traffic as well as overhead in the computer system running the console.
Yet another problem with network management systems has been inefficient or nonexistent means used to manage events occurring within the network, resulting in difficulty in coordinating recovery actions between the various management consoles throughout the network.
It is therefore an object of the present invention to provide an enterprise management system that will increase automation and efficiency in network management and decrease the complexity of such management.
It is another object of the present invention to provide an enterprise management system that is easy to implement and maintain as installed applications and computers change.
It is another object of the present invention to provide an agent system for use in an enterprise management system wherein the agent system utilizes the memory and CPU resources of a server computer system in an efficient manner, regardless of the number of console systems that are monitoring the resources on the server.
It is another object of the present invention to provide an enterprise management system that decreases the amount of network traffic associated with communication between agent processes and console processes.
It is another object of the present invention to provide an enterprise management system that enables the management of events in a network to be coordinated between the various console processes in the network.
It is another object of the present invention to provide an agent system for use in an enterprise management system wherein the agent system is autonomous and capable of monitoring and managing the resources on a server computer system regardless of whether a console system in the network is monitoring resources on the server.
Other objects and advantages of the present invention will be apparent to persons having ordinary skill in the art and having reference to the following specification and drawings.