When information technology professionals design systems, they face some very tough challenges. Users are no longer naive, and have high expectations about the reliability, availability and serviceability of the systems they use. Increasing competitive pressure on enterprises have forced other attitudes to change as well. Information is fast becoming many businesses' greatest asset, making the need to guarantee higher degrees of availability more important than ever.
Traditionally, the obstacle to accomplishing this has been cost. However, enterprises are finding ways of reducing the cost of ownership and operation of computer systems within their business environments. One of such approaches is the use of networked computers. Network computers enable enterprises to provide shared resources to multiple users while ensuring that system downtime is substantially reduced.
Sun Confidential
In a networked computing environment, a group of computer systems (or nodes) may be used to form a cluster. Clusters are characterized by multiple systems, or nodes, that work together as a single entity to cooperatively provide applications, system resources, and data to he user community. Each node can itself be a symmetric multiprocessing system (SMP) containing multiple CPUs.
Clusters provide increased performance and availability not available in single SMP systems. Performance is scalable across nodes, offering a high-end growth path: additional nodes, processors, memory and storage can be added as they are required. Clusters also provide increased availability: should one node fail, other cluster nodes can continue to provide data services, and the failed node's workload can be spread across the remaining members.
Clustered architectures are uniquely suited for the provision of highly available services. In a properly designed arrangement, they feature redundant paths between all systems, between all disk sub-systems and to all external networks. No single point of failure—hardware, software or network—can bring a cluster down. Fully integrated fault management software in the cluster detects failures and manages the recovery process automatically.
FIG. 1 is a simplified exemplary illustration of a prior art networked (cluster) computer system. The cluster 100 of FIG. 1 integrates redundant server systems, or nodes 110 A-D to ensure high availability and scalability. The cluster 100 includes redundant storage disks 120 A-C which are generally mirrored to permit uninterrupted operation in the event that one of them fails.
Redundant connections are provided to the disk systems 120 A-C so that data is not isolated in the event of a node, controller or cable failure. Redundant connections between nodes are also provided via private interconnects 130 A and 130 B to enable the nodes stay synchronized and work together as a cluster. All cluster nodes are connected to one or more public networks such as a publicly switched telephone network (PSTN) 140 enabling clients on multiple networks to access data. Because most clusters are intended to be managed as a single computing resource, a cluster management software may provide a unified control of the cluster.
The software simplifies administration by treating all servers in the cluster as a single entity. FIG. 2 is a exemplary block diagram illustration of the prior art software architecture 200 of the cluster 100 in FIG. 1. Although each of the nodes in cluster 100 may have its own independent software architecture, the cluster 100 is typically designed with a unified software architecture to support both database applications and other highly available data services. The software architecture in FIG. 2 includes a cluster framework layer 210, data services layer 220, application interface layer 230, operating systems layer 240 and a platform specific services layer 250.
The cluster framework layer 210 is at the heart of the overall software architecture with multiple data services and the application program interface layer 230 enables developers to integrate additional customer applications into the cluster architecture and make them available. The data services layer 220 and the cluster framework layer 210 are based on an underlying operating system layer 240 that enables and ensures programming interfaces and full support for multiprocessing and multi-threading application capabilities. Although the cluster software architecture 200 provides a unified architecture of support across nodes in the cluster, each cluster node may have different versions of software at any given time. In order to provide the services in the cluster 100, some of the protocols that are required need to be distributed—that is the software is essentially running on multiple nodes—in the cluster 100 simultaneously and not mere in one place at one time. For example, the cluster membership monitor uses a distributed protocol to determine which nodes are part of the cluster 100 and which nodes are not at any given time.
And as the versions of distributed protocols change over time, the incompatibility of software versions among nodes impeded communication between the nodes and hampers the ability of the cluster to provide service availability to users. Furthermore, the prior system illustrated in FIG. 2 lacks the ability to automatically and simultaneously upgrade software versions across the cluster. Consequently, software upgrades in the cluster can be very cumbersome, time consuming and costly impacting system and cluster resource availability to end users.
Thus, there is a need for a system and method for automatically managing software version changes in a cluster system to ensure system availability and to reduce the cost associated with system upgrades. A solution that provides a self consistency and avoids complex or error-prone version management mechanisms is also needed.