Distributed computing systems have found application in a number of different computing environments, particularly those requiring high performance and/or high availability and fault tolerance. In a distributed computing system, multiple computers connected by a network are permitted to communicate and/or share workload. Distributed computing systems support practically all types of computing models, including peer-to-peer and client-server computing.
One particular type of distributed computing system is referred to as a clustered computing system. “Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a client or user, the nodes in a cluster appear collectively as a single computer, or entity. In a client-server computing model, for example, the nodes of a cluster collectively appear as a single server to any clients that attempt to access the cluster.
Clustering is often used in relatively large multi-user computing systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
In many clustered computer systems, the services offered by such systems are implemented as managed resources. Some services, for example, may be singleton services, which are handled at any given time by one particular node, with automatic failover used to move a service to another node whenever the node currently hosting the service encounters a problem. Other services, often referred to as distributed services, enable multiple nodes to provide a service, e.g., to handle requests for a particular type of service from multiple clients.
As distributed computing systems become more complex, administration of such systems can become difficult and time consuming. Distributing computing systems increasingly are called upon to deliver a greater number and wider variety of services, including services provided by a multitude of vendors. Furthermore, the underlying hardware systems incorporated into a distributed computing system may be heterogeneous in nature, including systems of varying capabilities and design, and provided by different vendors.
Traditionally, software and hardware provided by different vendors, and even many components provided by the same vendor, have been managed individually, e.g., using individual management programs, also known as consoles, running locally on specific systems that provide specific services in the distributed computing environment. Dedicated programs for managing different services, however, often rely on different user interfaces and command structures, requiring system administrators to be proficient in multiple dedicated programs. Furthermore, given that distributed computing systems may be geographically dispersed, often a system administrator will need to be on-site in order to effectively manage some of the components in a distributed computing system.
In an effort to simplify the management of complex distributed computing systems, efforts have been made to standardize administration activities, as well as provide remote management programs that enable computing systems to be managed remotely. It would be highly desirable to an enterprise's system administrators if all of the services and hardware in the enterprise's distributed computing systems could be managed through only a few management consoles. In fact, the most desirable situation would be if a distributed computing system could be managed through a single management console. Unfortunately, however, due to the heterogenous and distributed nature of most distributed computing environments, it is rarely the case that this goal can be achieved.
One specific problem that arises in many distributed computing environments relates to the detection, reporting, diagnosis and remediation of errors or error conditions that occur in such environments. Particularly in distributed computing environments where services are dispersed across multiple hardware platforms, and involve the interaction of multiple systems, simply isolating an error to a specific system can be problematic. Even if some form of integrated management console is available to manage multiple systems, and even if an error condition is successfully routed to an integrated management console, a system administrator often will still be required to manually “poke around” (i.e., directly access, interrogate, examine, reconfigure, etc.) one or more individual systems in order to effectively isolate and rectify an error condition.
Efforts have also been directed toward making computer systems more autonomic, i.e., to incorporate self-optimizing, self-protecting, self-configuring and/or self-healing capabilities. Autonomic capabilities often lead to more reliable computer systems due to the fact that potential problems can often be addressed proactively, and often without manual intervention by a system administrator. In many instances, problems may be addressed prior to a computer system experiencing a failure that interrupts any services provided by that computer system.
To the extent that autonomic features have been incorporated into computer systems, however, such features are often tied to specific architectures and limited failure scenarios. As such, conventional autonomic systems have not found widespread acceptance in heterogeneous distributed computing environments that integrate a wide variety of computer systems and provide a variety of services. Therefore, a significant need continues to exist in the art for an improved manner of facilitating the reporting and diagnosis of errors in distributed computing environments.