1. Technical Field
The present invention is directed to managing a large distributed computer enterprise environment and, more particularly, to diagnosing and correcting network faults in such an environment using mobile software agents.
2. Description of the Related Art
Today, companies desire to place all of their computing resources on the company network. To this end, it is known to connect computers in a large, geographically-dispersed network environment and to manage such an environment in a distributed manner. One such management framework consists of a server that manages a number of nodes, each of which has a local object database that stores object data specific to the local node. Each managed node typically includes a management framework, comprising a number of management routines, that is capable of a relatively large number (e.g., hundreds) of simultaneous network connections to remote machines. The framework manages hundreds of megabytes of local storage and can spawn many dozens of simultaneous processes to handle method requests from local or remote users. This amount of power, however, is quite costly. Each managed node requires upwards of a megabyte of local memory of disk plus a permanent TCP/IP connection. If a managed node sees heavy use, then such costs go up considerably. Moreover, as the number of managed nodes increases, the system maintenance problems also increase, as do the odds of a machine failure or other fault.
The problem is exacerbated in a typical enterprise as the node number rises. Of these nodes, only a small percentage are file servers, name servers, database servers, or anything but end-of-wire or xe2x80x9cendpointxe2x80x9d machines. The majority of the network machines are simple personal computers (xe2x80x9cPC""sxe2x80x9d) or workstations that see little management activity during a normal day. Nevertheless, the management routines on these machines are constantly poised, ready to handle dozens of simultaneous method invocations from dozens of widespread locations, invocations that rarely occur.
Moreover, the problem of keeping a distributed management framework connected is a continuous job. Any number of everyday actions can sever a connection or otherwise contribute to a fault condition. As a result, in large, distributed computer networks such as described above, network problems are complicated and difficult to diagnose. Although certain xe2x80x9ccluesxe2x80x9d may be present that would lead a skilled technician or expert program to arrive at a list of probable causes for the failure, it is often quite difficult to determine where the fault originates. Moreover, even when the fault location and its cause are identified with certainty, it then becomes necessary for a system administrator to manually correct the fault or to dispatch others to the location for this purpose.
It would be a significant advantage to provide some automatic means of diagnosing and correcting network problems in this type of computer environment. The present invention addresses this important problem.
It is a primary object of this invention to automatically diagnose faults or other events that occur in a large, distributed computer network.
It is another primary object of this invention to deploy a software xe2x80x9cagentxe2x80x9d into a distributed computer network environment to diagnose and, if possible, correct a fault.
It is yet another object of this invention to select a given software agent from a set of such agents based on a particular fault and to dispatch the selected agent into the network to locate and correct the fault.
Yet another object of this invention is to automate the diagnosis of network events in a large, distributed computing network.
A still further object of the invention is to dispatch, into a large distributed computer network, the minimum amount of code that may be necessary to rectify a given network fault.
Another object of this invention is to deploy a self-routing software agent into a distributed computer network to locate and correct a network fault or to address some other network event. Preferably, the software agent is a minimum set of tasks that are identified for use in diagnosing and/or correcting the fault.
Yet another object of the present invention is to collect information about network conditions as mobile software agents are dispatched and migrated throughout a large computer network to correct network faults, wherein such information is then useful in diagnosing new faults.
These and other object of the invention are provided in a method of diagnosing a given event (e.g., a fault, an alarm, or the like) in a large, distributed computer network in which a management infrastructure is supported. The management infrastructure includes a dispatch mechanism preferably located at a central location, and a runtime environment supported on given nodes of the network. In particular, the runtime environment (e.g., an engine) is preferably part of a distributed framework supported on each managed node of the distributed enterprise environment. The method begins upon a given event. In response, the dispatch mechanism selects a software xe2x80x9cagentxe2x80x9d, preferably from a set of software agents useful in diagnosing network events. Alternatively, the dispatch mechanism creates the software agent by assembling a set of one or more tasks. The software agent selected or created is preferably a set of tasks that are selected or assembled based on the nature of the given event. Thus, the particular triggering xe2x80x9ceventxe2x80x9d is used to provide clues as to the network location to which the agent should be sent, as well as the type of agent to send.
The software agent is then deployed into the computer network, for example, to determine a cause and location of the given event. When the software agent is received at a given node, the method determines whether the event originated from the node. If so, the software agent identifies the cause and, if possible, undertakes a corrective or other action depending on the nature of the event in question. Thus, for example, if the event were a fault, the software agent attempts to correct the fault. If, however, the event originated elsewhere, the software agent identifies a subset of nodes in the computer network that remain possible candidates for origination of the event. Preferably, the software agent then replicates itself to create a new instance. This new instance is then launched to the identified subset to continue searching for the location and cause of the event. At each node, this process is repeated until the location is identified.
At each node, the software agent is preferably run by the runtime engine previously deployed there. Alternatively, the software agent runs as a standalone process using existing local resources. As noted above, when the event is a fault, the software agent locates the fault and attempts to rectify it. If necessary, the software agent may obtain additional code from the dispatch mechanism or some other network source. Such additional code may be another software agent.
The foregoing has outlined some of the more pertinent objects of the present invention. These objects should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Many other beneficial results can be attained by applying the disclosed invention in a different manner or modifying the invention as will be described. Accordingly, other objects and a fuller understanding of the invention may be had by referring to the following Detailed Description of the preferred embodiment.