1. Technical Field of the Invention
This invention relates generally to the field of the operation and management of complex systems, including the operation and management of computer networks.
2. Description of Related Background Art
The present invention is intended to facilitate the management of a large-scale, far-flung computer network, such as the extensive distributed systems that are commonplace nowadays in large organizations. The person or team responsible for this job is typically in charge of everything from the organization""s power supplies through its business software applications. The organization""s business management, naturally, may not wish to concern itself with the technical details, but does demand that when problems occur, they be dealt with according to the seriousness of the effects they have on the normal operations of the business. For example, management will want the greatest attention to be paid to those problems that affect the highest revenue generators among the various parts of the business organization.
This is a difficult demand to meet. For many network operation managers, it can be very hard just managing the network, identifying, diagnosing and correcting problems as they occur. Being able to prioritize among a set of problems occurring during the same time period in such a way as to differentiate among levels of service being provided to different parts of the business organization has thus far been beyond contemplation. One important purpose of the present invention is to make this goal attainable.
The phenomenal complexity of the world of a large distributed network of interrelated components is reflected in the distribution of costs involved in managing such a system. According to one study, about $2.00 of every $10.00 spent on distributed systems engineering and operations, is spent on engineering, while the other $8.00 is for operations. Moreover, about $6.00 of that $8.00 is spent on problem isolation and diagnosis, while only about $2.00 goes to problem resolution.
If it takes on average three times as long to identify a problem as it does to solve it, the soup of distributed systems parts (hardware and software) and their interrelationships is nearly impenetrable to the operators. This complexity has many sources:
Hardware and software components are heterogeneous. System components are globally distributed. Subcontractors may be running the system, or parts of it, on their own sites, or the business""s, or both.
Engineers include multiple redundancies in the design of the system to minimize outages, but each redundancy adds extra complexity to manage. Systems themselves are not self-aware, and cannot report what is wrong with them. At best, individual components can report their states. Component reuse leads to the same components participating in multiple run-time relationships. The xe2x80x9chealthxe2x80x9d of a given component increasingly depends on a contextual, not isolated, evaluation of its state.
A given underlying condition may affect different users in different ways, or to different degreesxe2x80x94one user may be affected seriously, another critically, another benignly or not at all. Problems cascade; locating the eye or center of a storm of phenomena is not easy.
It may even be deemed surprising that only 75% of operations time is spent on identifying problems.
At present, operators are unable to tell how a given problem affects the various users in the business organization, and therefore are unable to know where they should direct enhanced or reduced service efforts, until the problem has been correctly identified. One result of this is that the operations managers have only the other 25% of operations timexe2x80x94the problem resolution portionxe2x80x94from which to carve out all service differentiation.
What is worse, identification of the problem does not necessarily lead clearly to successful resolution of the problem. For example, suppose that the operator has correctly identified the root of a given problem as a bad card in an IP (xe2x80x9cInternet Protocolxe2x80x9d) router. Do any critical business systems depend on that router? Perhaps, or perhaps not.
Continue with the same example. Suppose that the malfunctioning router lies on one leg of a redundant circuit that connects many disparate data delivery functions in a financial services organization. What effect does the fault have on various users?
The network system administrator always needs to know immediately, so that he can go and replace the card.
The manager of a profitable business unit may have invested in redundant circuits, and so experiences no problem.
The manager of a mid-sized unit has co-invested in redundant circuits with another business unit; their joint load on the single remaining circuit permits continued service, but performance deteriorates.
Network engineering has been experimenting with new router cards on their alternate circuit and has rendered that circuit inoperable; they have no service at all.
A market analyst in Brussels receiving critical data from Hong Kong is going to be delayed when she loses all service; she need not have any idea what a router is, or that one exists, but she does need to understand quickly the impact of its disappearance on her work.
A capacity planner needs to know the frequency with which router cards fail, if one brand suffers more failures than another, or if it is necessary to invest in redundant circuits for a group of users whose work is time-sensitive. She does not need to know this instant that some specific router had a bad card.
This single example of a set of failures among computing system components has affected users quite differently. For operations personnel, knowing that the cause of the current set of events was a malfunctioning router card is a start, but provides inadequate understanding for addressing all these needs.
Before the operator can direct problem resolution efforts to a specific part of the business organization, therefore, he or she needs to understand the systemic impact of the problem. Impact is sensitive to a wide system context, and even to conditions of the moment (for instance, the task the Brussels analyst is working on). The operations manager can attempt to deliver differentiated levels of service only when she knows whether and how this particular fault has affected particular groups of users under the conditions of the network at the time of the failure.
It is one object of the present invention to provide a solution to the problem described above. In particular, it is an object to provide the ability to understand the impacts of a given problem on different parts of the organization using the system, at the time the problem occurs, so as to be in a better position to direct problem resolution efforts and problem alleviation efforts intelligently.
Another object of the invention is to provide the ability to model, not only the significant hardware and software resources of the system being administered, but also the service relationships connecting those resources, in a flexible, dynamic manner, so that changes to the construction or make-up of the system being managed can be reflected promptly in the model without the need to restart the model or otherwise to interrupt running the model.
Another object of the invention is to provide a method and system that can associate related events that are of interest to the operators and users of the administered system, and present the results quickly and in a way that makes the information easy to use.
Another object of the invention is to provide a method by which one can flexibly model a system, and in which one can represent, not only the hardware and software resources of the system being modeled, but also arbitrarily-defined groups of those resources.
Still another object of the invention is to provide a method and system in which the operators or a user can define, as needed, a set of data to be obtained relating to the performance of the modeled system, and to provide a particularly convenient way to organize control data to fulfill those requests using agents to obtain the required information.
The preferred embodiment provides a software model of the managed network, and includes a flexible infrastructure for the purpose of obtaining information from the managed network and reporting it as appropriate. In runtime, the data-gathering infrastructure is used to obtain information about what components are present in the network, and about what services each is providing to which other component(s). This information is used to construct the model. In addition, the data-gathering infrastructure obtains from the managed resources information relating to any malfunction or performance degradation, and reports this information to the model, which modifies its state accordingly. The structure of the model itself is used to predict the likely impacts of the reported occurrence, and the occurrence and its predicted impacts are displayed. As all this happens, the data-gathering infrastructure also obtains information concerning the addition of new components to the managed network, the deletion of others, etc., allowing the model to update itself during runtime.
In addition, the system administrators can define elements in the model to represent arbitrary groupings of components, such as business units. AS a result, the model predicts impacts not only on individual hardware and software components but also on larger entities that are of significance to the organization using the invention and the managed network.
The data-gathering infrastructure is conceptually distinct from and independent of the model. In the preferred embodiment, this infrastructure has a number of significant features, including a hierarchical structure that results in the ability to provide the model with as large a stream of data as may be necessary, while limiting the number of interrupts per unit time that the model must tolerate. In addition, this infrastructure preferably has the ability to be given new sets of working instructions during runtime, so that new types of information can be acquired, without the need for restarting the running of the program. Customized inquiries can also be provided in this way. Moreover, the data-gathering infrastructure uses software agents having a structure that makes possible a high degree of reusability, in the form of reusable modules that can be kept in a repository for that purpose.
It is to be emphasized that it is by no means necessary to use all these features together; many can be used independently of the others, to great advantage, within the scope of the invention.
The foregoing and other objects, features and advantages of the invention will be more fully appreciated from the following detailed description of the preferred embodiment, taken in conjunction with the accompanying drawings.