Distributed computer systems are widely used to deliver computing and information services to users who access the system over computer networks. The services provided by such systems include, for example, web services, remote terminal services, online shopping, electronic business services, online database access, and enterprise computing and intranet services, amongst many other possibilities.
The overall performance of any distributed computer system may depend upon the performance of a very large number of resources that are utilised by clients of the system in utilising the services that it provides. The resources that may influence overall system performance include not only the computing servers providing the main application services of the system, but also the shared network services, communications services, and access systems, such as network switches, routers, and data links that are essential for providing access to the main application services.
Accordingly, performance of distributed systems may be influenced by numerous factors, including traffic overload in parts of the interconnecting networks, the placement and interconnection of network resources, failures or degradation in the performance of various software and/or hardware components of the system, and the like. The performance issues become increasingly complex and difficult to understand and manage as the system and associated networks become larger and more complex. For example, if an element of the system is not responding to service requests, or is responding slowly, it may be difficult to determine whether the fault is in the element itself, in a data communication link, or in another element of the system, such as an intermediate network device, shared service or memory object that may be affecting the ability of the system element to receive and/or respond to requests.
Network and system management platforms, also referred to as management systems, are. intended to assist network and service operators in resolving such issues. Such network management platforms typically operate by collecting information from specified components of a distributed computing system, and making this information available for display and review by the system operator. For example, a management platform typically includes a graphical representation of the managed system. Alerts may be generated to inform the operator that an event has occurred that may require attention. In large systems, many such events may occur simultaneously, and accordingly most management platforms provide alert prioritisation and filtering.
Commercially available management platforms include SPECTRUM from Cabletron Systems, Inc, HP OpenView from Hewlett Packard Corporation, LattisNet, from Bay Networks, IBM Netview/6000 from IBM Corporation, and SunNet Manager from SunConnect.
While known management platforms are useful in enabling networks and information systems to be monitored, and sources of possible problems to be identified, there are nonetheless a number of problems associated with their installation and operation. In most cases, known management platforms are designed to collect and monitor a specific set of metrics associated with the managed devices and components of the system. It is often necessary to install additional components, or “agents” within the elements of the system to collect information about the resources associated with the element. Such platforms are typically based upon an object oriented architecture, that imposes a common object model upon all of the managed resources. This is done in order to provide a consistent interface between the managed elements and the management server and/or management applications that are used to monitor and control the managed resources.
Accordingly, traditional management platforms are limited to the collection and monitoring of a specific set of metrics of the managed resources, and constrained to managing the resources only of those elements within which suitable management agents have been installed. It is therefore not usually possible for the management system to adapt to changes to the architecture of distributed system or to monitor components outside the system under the control of a system operator without the installation of further management agents. This can be a significant limitation, since the system performance experienced by an end user may be affected by the performance of shared network services, such as Domain Name Services (DNS), that may be provided by servers that are located outside the control of the operator of a particular information service.
Furthermore, the interpretation of the metrics provided by traditional network management platforms requires expert knowledge of the systems and the metrics involved. Known management platforms do not provide performance metrics that are specific to particular information services, and that are intuitively meaningful to users or non-expert operators of information systems. In many cases, if an alert is generated by an event within the system, it may be difficult to relate the source of the alarm to any degradation in system performance that is experienced by end users. Conversely, end users may experience degradation in system performance, resulting in complaints, or calls to a help desk, that may not be readily associated with any specific change in the available metrics, or any alerts that may have been raised.
Accordingly, previous attempts to automate the prediction, detection and correction of causes of performance degradation have been largely unsuccessful, resulting in erroneous outcomes including false identification of problems where no degradation in user performance is experienced, and/or failures to identify causes of performance degradation that is experienced by end users.
Furthermore, when users do report faults or degradation in system performance, there may be a delay between the time at which the performance problems are experienced, and the time at which they are ultimately reported to a system manager. It may therefore be difficult to precisely pinpoint the time at which the performance problems occurred or commenced, and it may therefore be difficult to associate the performance problems with specific events, or changes in the metrics of the managed resources in the system. Accordingly, the correlation of events with changes in system performance is inherently subjective, and the identification of a root cause of such performance problems is also subjective, and therefore dependent upon the skill and expertise of the systems manager in interpreting the available information.
Accordingly, there remains a need for methods and apparatus for managing distributed computing systems that are able to mitigate at least one of the aforementioned problems experienced when using currently available management systems.
Any discussion of documents, devices, acts or knowledge in this specification is included to explain the context of the invention. It should not be taken as an admission that any of the material formed part of the prior art base or the common general knowledge in the relevant art on or before the priority date of this application.