1. Field of the Invention
This disclosure relates in general to improving data reliability using multiple node clusters, and more particularly to a method, apparatus and program storage device for providing a triad copy of storage data in multiple node clusters.
2. Description of Related Art.
Computer architectures often have a plurality of logical sites that perform various functions. One or more logical sites, for instance, include a processor, memory, input/output devices, and the communication channels that connect them. Information is typically stored in a memory. This information can be accessed by other parts of the system. During normal operations, memory provides instructions and data to the processor, and at other times the memory is the source or destination of data transferred by I/O devices.
Input/output (I/O) devices transfer information between at least one internal component and the external universe without altering the information. I/O devices can be secondary memories, for example disks and tapes, or devices used to communicate directly with users, such as video displays, keyboards, touch screens, etc.
The processor executes a program by performing arithmetic and logical operations on data. Modern high performance systems, such as vector processors and parallel processors, often have more than one processor. Systems with only one processor are serial processors, or, especially among computational scientists, scalar processors. The communication channels that tie the system together can either be simple links that connect two devices or more complex switches that interconnect several components and allow any two of them to communicate at a given point in time.
A parallel computer is a collection of processors that cooperate and communicate to solve large problems fast. Parallel computer architectures extend traditional computer architecture with a communication architecture and provide abstractions at the hardware/software interface and organizational structure to realize abstraction efficiently. Parallel computing involves the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain faster results.
There currently exist several hardware implementations for parallel computing systems, including but not necessarily limited to a shared-memory approach and a shared-disk approach. In the shared-memory approach, processors are connected to common memory resources. All inter-processor communication can be achieved through the use of shared memory. This is one of the most common architectures used by systems vendors. However, memory bus bandwidth can limit the scalability of systems with this type of architecture.
In a shared-disk approach, processors have their own local memory, but are connected to common disk storage resources; inter-processor communication is achieved through the use of messages and file lock synchronization. However, I/O channel bandwidth can limit the scalability of systems with this type of architecture.
A computer cluster is a group of connected computers that work together as a parallel computer. All cluster implementations attempt to eliminate single points of failure. Moreover, clustering is used for parallel processing, load balancing and fault tolerance and is a popular strategy for implementing parallel processing applications because it enables companies to leverage the investment already made in PCs and workstations. In addition, it's relatively easy to add new CPUs simply by adding a new PC to the network. A “clustered” computer system can thus be defined as a collection of computer resources having some redundant elements. These redundant elements provide flexibility for load balancing among the elements, or for failover from one element to another, should one of the elements fail. From the viewpoint of users outside the cluster, these load-balancing or failover operations are ideally transparent. For example, a mail server associated with a given Local Area Network (LAN) might be implemented as a cluster, with several mail servers coupled together to provide uninterrupted mail service by utilizing redundant computing resources to handle load variations for server failures.
Within a cluster, the likelihood of a node failure increases with the number of nodes. Furthermore, there are a number of different types of failures that can result in failure of a single node. Examples of failures that can result in failure of a single node include processor failure at a node, a non-volatile storage device or controller for such a device failure at a node, a software crash occurring at a node or a communication failure occurrence that results in all other nodes losing communication with a node. In order to provide high availability (i.e., continued operation) even in the presence of a node failure, information is commonly replicated at more than one node. For example, storage servers can be viewed as a specialized parallel computer, which is optimized to accept requests from clients who want to read or write data. The specialized parallel computer can be thought of as 2 nodes, or controllers, closely coupled, each connected to clients or a SAN. The two nodes communicate over some communication network and can mirror write data, check to see if requests are cached and use each other as a failover partner when serious errors occur. Thus, in the event of a failure of one node, the information stored at the compromised node can be obtained at the node, which has not failed.
It is common for each controller to handle even or odd logical unit numbers (LUNs) and/or even or odd count key data (CKD) volumes. When a customer writes a sector or block in the storage system, the storage system will make a copy on both nodes. These nodes may be battery backed up by some mechanism and so the data is protected from a power outage and/or a failure of one of the nodes. However, if a user needs to update firmware on one controller of a two node system, leaving only one node running, the possibility of an outage is present if the node left running experiences failure. Similarly, if one node of a two node system experiences failure, resulting in single node operation, the possibility of an outage is increased for the node left running.
“Pure” or symmetric cluster application architecture uses a “pure” cluster model where every node is homogeneous and there is no static or dynamic partitioning of the application resource or data space. In other words, every node can process any request from a client of the clustered application. This architecture, along with a load balancing feature, has intrinsic fast-recovery characteristics because application recovery is bounded only by cluster recovery with implied recovery of locks held by the failed node. Although symmetric cluster application architectures have good characteristics, symmetric cluster application architectures involve distributed lock management requirements that can increase the complexity of the solution and can also affect scalability of the architecture.
It can be seen that there is a need for a method, apparatus and program storage device for extending node clusters in order to increase data reliability within a storage server environment.