1. Field of the Invention
The invention relates to storage subsystem architectures and in particular to a storage controller in a clustered environment that is able to recover from a configuration data mismatch.
2. Discussion of Related Art
Clustered environments are used to improve speed and reliability over that provided by a single server. Clustered environments typically consist of a plurality of physical machines (e.g., nodes), and may additionally consist of a plurality of storage devices, such as hard drives. Reliability may be improved by providing redundant processing and storage of data within the clustered environment. Thus, multiple nodes (e.g., servers) provide reliability on the processing side, while multiple storage devices provide reliability for data storage. All of the devices may be linked together, such that data may be redundantly written to multiple storage devices, while multiple nodes may have access to the redundant data on multiple storage devices. For example, in a Redundant Arrays of Independent Disks (“RAID”) 1 mirror configuration, two storage devices may store mirror images of the same logical volume. Thus, when one storage device is updated, the other storage device is updated as well.
There are many different configurations in a clustered environment. One such configuration is an active-active configuration, where two nodes operate at the same time to process two requests in parallel (e.g., the first node processes one request, while the second node processes another request). Both nodes then write data related to the two separate requests to the same set of storage devices.
Another configuration in a clustered environment is an active-passive configuration. In an active-passive configuration, the active node processes requests for the clustered environment and writes data to the redundant storage devices. The passive node waits idle in case the active node is unable to continue processing for any reason (e.g., a power failure or hung condition), and then switches to an active mode and assumes the duties of the active node.
One such duty includes updating configuration data of the clustered environment. All of the storage devices, as well as the nodes, may store configuration data regarding the number of logical drives, the number of physical drives, the RAID level, stripe size, cache policy, etc., of the clustered environment. As storage devices fail, power on or off, or go online or offline, the configuration data of the clustered environment is updated on each of the remaining available storage devices. For example, if power is lost to some of the storage devices, then the storage devices are no longer available. The physical status of the powered off storage devices is changed in the configuration data of the remaining online storage devices by the active node storage controller to reflect the new offline physical status. Once the storage devices are powered back on, the physical status in the configuration data of the storage devices is changed accordingly by the active node storage controller to reflect the availability of the storage devices.
A problem arises when configuration data among the multiple devices becomes mismatched, and the active node is unable to resolve the difference. Once this situation occurs, the clustered nodes may be unable to resolve the discrepancy and may need to suspend processing until an operator reconfigures the configuration data and thus manually resolves the mismatch.
Consider for example an active-passive configuration in which the active node is responsible for updating configuration data of the clustered environment on the storage devices. Assume that a portion of the storage devices receive power from the same shared power supply as one of the nodes (e.g., the passive node). If a power outage occurs on the shared power supply, then the node, as well as a portion of the storage devices connected to the shared power supply will no longer be online. The active node (e.g., the node still powered on) will detect the loss of the other node and a portion of the storage devices, and update the configuration data on the remaining online storage devices to reflect that some of the storage devices have failed. When power is restored to the passive node, offline storage devices are powered up as well. The storage devices are “spun up” and ready for operation well before the passive node completes its power up initialization process. The active node is normally still operational and updates a copy of the configuration data on each recently powered up storage device, as well as the other storage devices, to reflect the new online physical status of the recently powered on storage devices. Thus, when the passive node completes its initialization and compares configuration data on each of the connected storage devices, the configuration data will be consistent among all of the storage devices.
However, if the active node is in a hung condition or otherwise not operable to update configuration data on recently powered up storage devices, then a configuration data mismatch may occur. When the passive node later initializes and compares configuration data on each of the connected storage devices, the configuration data is not consistent, and the passive node is not able to resolve the mismatch. The clustered environment then enters a hung condition, and needs to wait for operator intervention before proceeding with normal operation again. This downtime can be costly in a variety of high demand applications. For example, when clustered environments are deployed for revenue sensitive applications, such as telecom billing, a downtime can translate directly to lost revenue.
It is evident from the above discussion that a need exists for an improved structure and method for recovering from configuration data mismatches in a clustered environment.