A resource management and analysis system is typically used to manage (e.g., monitor and control) the operation of ever increasing networked systems and networks of networked systems. A distributed system (e.g., a computer or communication system) generally includes many individual components (e.g., nodes or devices), which may be implemented using both hardware and software elements. The individual devices, and the relationships between them, conventionally define the “topology” of a distributed system or similar resources, e.g., distributed applications.
A resource management system typically includes a plurality of agents that are assigned to a centralized manager. The agents of the resource management system are used to monitor, control, and otherwise influence the behavior of the devices or elements of the managed distributed system. These agents may be any suitable software or hardware element that is capable of collecting information, e.g., statistics, about the behavior of a device and/or enacting required changes to the device. Moreover, any number of the components in a distributed system may be associated with one or more agents, although each component for which monitoring and/or control is desired must be associated with at least one agent.
A centralized manager coordinates the operation of the agents in the resource management system. As is the case with agents, the centralized manager may be any suitable software or hardware element, although it must be capable of performing tasks required (or useful) to monitor or control a distributed system, such as analysis (performance or fault), configuration changes, etc. In many types of resource management systems, the agents run on or in the same network of the respective network devices they are monitoring and/or controlling while the manager remotely collects information from one or more agents to perform its task as a whole.
It is important to note that the agents are not required to be on the same network as the managed device or on the device itself. The distinction between the manager and the agent is in their functionality (e.g., monitoring, control, or analysis) rather than their location relative to the devices.
The resource management and analysis system may, in receiving information from the agents, may perform an analysis of the distributed system. For example, the agents may provide indicators of events occurring or detected in a corresponding network element to the resource management system and the resource management system may utilize this information to perform an analysis of the health and/or status of the network. A method and system that may be used to perform an analysis that described in commonly-owned U.S. Pat. Nos. 5,528,516; 5,661,668; 6,249,755; 6,868,367, and 7,003,433, the contents of which are incorporated by reference herein. The aforementioned US Patents teach performing a system analysis based on a mapping of observable events and detectable events, e.g., symptoms and problems, respectively, to determine the cause of the detected events or indicators being generated. Impact analysis is a similar analysis that may be performed based on the information provided by the agents. In one aspect, a measure may be determined based on a difference between values associated with relationship or correlation between the possibility of a symptom being caused by a problem. The measure may be a Hamming distance.
A limitation on the performance of resource management systems has traditionally been size of the network or the system being managed. Large systems, that have components or elements distributed over a wide geographic area, can present an unsustainable computational burden on the resource management system. One approach often used to alleviate the burden on the resource management system of a distributed system, and to thus improve scalability, is to create a distributed-architecture management system. In a distributed-architecture management system, a single, centralized, manager is replaced by a plurality of managers, each of which oversees a subset of the agents in the distributed system, network or resource. Each manager is associated with a respective partition or subset of the distributed architecture management system.
One method proposed for distributing the agents is described in commonly-owned U.S. patent application Ser. No. 11/952,395, entitled “Method and Apparatus for Arranging Distributed System Topology Among a Plurality of Network Managers,” filed on Feb. 7, 2005, the contents of which are incorporated by reference, herein. As is described, the network is subdivided into an initial set of groups of managed elements. The subsequent formulation of the groups, and the associated agents, is determined in accordance with an iterative process. The process limits the number of managed entities for each agent to prevent the overburdening of any one agent in performing its management and analysis functions.
However, such distribution of the management and analysis function into a plurality of requires a further coordination of the information provided from each agent. The inclusion of this coordination requires addition processing capability in understanding the relationships between the different management agents and must be altered to accommodate the introduction of additional management agents and their relationships.
Hence, there is a need in the industry for a resource management and analysis system that provides for scalability of the resource management system capabilities while not requiring proportional increase and burdening the underlying elements of the resource management system.