The present invention relates generally to the field of array storage devices in computer processing systems and networks. More specifically, the present invention relates to a controller fault recovery system for recovering from faults that cause unscheduled stops for a distributed file system operating on an array storage system having multiple controllers that provides for a proxy arrangement to protect data integrity in the event of a unscheduled stop on just one controller in the array storage system and for an atomic data/parity update arrangement to protect data integrity in the event of an unscheduled stop of more than one controller.
The use of array storage systems to store data in computer processing systems and networks is well known. Five different classes of architectures of array storage systems are described under the acronym xe2x80x9cRAIDxe2x80x9d (Redundant Array of Independent/Inexpensive Disks). The primary purpose of a RAID system is to detect corrupted data, and, depending upon the class of the RAID system and the extent of the data corruption, use the data stored on other drives in the disk array to correct the corrupted data. In a RAID 5 disk array, for example, a parity technique is used that protects against the failure of a single one of the disk drives that make up the disk array. Data is written to the disk array in multiple data blocks with a defined number of data blocks (N) making up a parity group. Each parity group is protected by a parity block which is also written to the disk array. The parity block is generated by an exclusive or (XOR) operation on all of the data blocks in the parity group. When the parity group is read, the XOR operation is performed on the data blocks for the parity group and the results are compared with the results stored in the parity block to detect potential corrupted data. In a RAID 5 disk array, each of the data blocks in a parity group, as well as the parity block, is stored on a different disk drive. Therefore, there are a minimum of N+1 disk drives in the RAID 5 array for a parity group having N data blocks. In other words, for a disk array having N disk drives, there can be only Nxe2x88x921 data blocks in a parity group.
Due to the large amount of processing that can be required to implement error detection or error correction techniques, most existing array storage systems are implemented as a set of disks uniquely attached to and managed by a specialized hardware disk controller. In addition to the normal buffers and input/output circuitry required for passing data between a requestor and the disk array, the specialized disk controllers will typically have additional hardware, such as XOR hardware circuitry for computing parity and nonvolatile RAM (NVRAM) for caching and logging. These type of array storage systems are often referred to as hardware RAID systems.
The use of error detection and error correction techniques in RAID systems was initially thought to provide a sufficient level of reliability for these storage systems. Unfortunately, even RAID systems are susceptible to corruption in the event of an unscheduled stop due to a hardware or software error if it occurs during the period when updates or modifications are being made to either a data block or a parity block. Because there is a relatively high correlation of unscheduled stops with hardware faults that may prevent all of the drives in a disk array from being accessed, after such an unscheduled stop, it is often necessary for the system to reconstruct data on a lost drive. For example, if one drive out of a four drive RAID 5 disk array is inaccessible after an unscheduled stop, it will be necessary to reconstruct all of the information on that lost drive using the data and parity on the remaining three drives. If the unscheduled stop occurs during a period when updates or modifications are being made, the problem is deciding what are the proper contents of the blocks in the parity group on the remaining three drives that should be used to reconstruct the data. For example, if a new data block was written before a crash, but the corresponding new parity block was not, then the recovered data would be inaccurate if the information on the lost drive were to be reconstructed with the new data block contents and the old parity block contents.
The problem of knowing which data/parity blocks were successfully written and which were not is compounded by the fact that the buffers of the controller store data in volatile RAM memory during the relatively long period (of unpredictable length) required to actually write the data to the disk array. In the event of an unscheduled stop during this period, the data may or may not have been written to the disk array, and the contents in the volatile RAM memory are lost. Many hardware RAID systems choose to ignore this problem and reconstruct the data from whatever contents of the data and parity blocks are present on the remaining drives. Other systems recognize the problem and attempt to reconstuct a version of the lost data using a predetermined data pattern as described in U.S. Pat. No. 5,933,592. Some hardware RAID systems solve this problem by performing actions in order and using the NVRAM to maintain a log of the actions as they are performed. If an unscheduled stop or error occurs, then the controller plays back the log in the NVRAM and attempts to restore the data to a known state. An example of this type of error recovery using non-volatile storage in the controller is described in U.S. Pat. No. 6,021,463. Other examples include the Enterprise Storage system from Network Appliance that uses the WAFL file system and a non-volatile RAM to keep a log of all requests processed since the last consistency point (i.e., every 10 seconds or so a snapshot of the file system is stored in the NVRAM), and the SPRITE file system from the University of California, Berkeley that used a non-volatile recovery box to store critical system information.
While hardware RAID systems can be effective in solving many of the problems of array storage systems, such systems can be expensive, complicated and less efficient than non-RAID storage systems. The use of NVRAM to periodically store file system information does allow for recovery at each consistency point, but does nothing to avoid or minimize the loss of data resulting from errors occurring between such consistency points. Additionally, the ability to scale hardware RAID systems or use hardware RAID systems in a larger network environment can be limited. In an effort to decrease the cost and complexity of hardware RAID systems, software implementations of RAID systems have been developed for use with disk arrays that have relatively simple hardware controllers.
Most software RAID systems are implemented as part of a centralized file system that governs how the data is to be stored. However, such software RAID systems are subject to the same problems as the hardware RAID systems described above. Software RAID systems can be designed to recover from an unscheduled stop when all of the disks are available after the unscheduled stop by scanning all of the data on the system and performing error detection or parity checks to verify the accuracy of the data. Unfortunately, the amount of time required for this kind of recovery can be very lengthy. To solve the problem of having to scan all of the data for accuracy in the event of an unscheduled stop or error, some software RAID systems use a bit map stored along with the control information or meta-data for a file to indicate whether the parity for the data blocks that make up that file is accurate. Examples of the use of such parity bit maps are described in U.S. Pat. Nos. 5,574,882 and 5,826,001.
Though parity bit maps can be effective in decreasing the time required for recovery, parity bit maps do not address the problem of whether data in the buffers of the controller was successfully written to the disk array. One solution to this problem is described in U.S. Pat. No. 6,041,423 that uses undo/redo logging to perform asynchronous updates of parity and data in an array storage system. This patent uses a centralized file system that maintains a version number for each page or block of data or parity. Software in the controller creates a log of changes between versions of each page. Updates to parity are made asynchronous to writing data to the storage array and are preferably deferred and accumulated in order to improve performance. Other examples of logging for centralized file systems include the Veritas File System which attempts to enable constant file system data integrity by maintaining information on incomplete metadata changes in an intent log, the XFS file system from Silicon Graphics which provides asynchronous metadata logging to ensure rapid crash recovery, and the Structured File Server for the AFS file system from IBM/Transarc which provides a metadata manager that uses B-tree clustered file records and mirrors all metadata for reliability.
In a centralized file system these techniques are implemented and managed at a single controller or node. The use of a distributed or shared file system involving multiple storage nodes that can be accessed by multiple requesters; however, further complicates the challenges of providing accurate high performance data storage that can be recovered or reconstructed in the event of an unscheduled stop. One such parallel file system that uses a common meta-data node for storing meta-data is described in U.S. Pat. Nos. 5,960,446 and 6,032,216. A distributed file system that uses a decentralized meta-data system is described in U.S. Pat. No. 6,029,168.
One of the problems encountered in a distributed or shared file system is the problem of insuring that only one requestor accesses a file at a time which is sometimes referred to as coherency. U.S. Pat. No. 6,058,400 describes a cluster coherent distributed file system that coordinates the storage of meta-data by providing a cluster file system which overlays a plurality of local file systems. This patent solves the problems of coherency by selectively flushing meta-data for a file stored in the controller of one of the storage nodes to disk. U.S. Pat. No. 5,999,930 describes the use of a common control process and control table to allow clients to obtain exclusive write access control over a data storage volume to solve the problem of coherency. While these technique addresses the problem of coherency on a volume or file level, they do not provide a solution to maintaining coherency between different controllers or storage nodes that each have blocks of data and/or parity stored as part of a parity group under control of a distributed file system.
In the context of insuring coherency between parity and data updates in a parity group stored under control of a distributed file system having centralized metadata, the logging and mirroring techniques previously described are used to protect the centralized metadata. When a logging technique is used, a single log of all transactions on the parity disk is maintained as an additional tool to aid in the proper reconstruction of data and parity in the event of an unscheduled stop. When used in RAID 5, there is one journal entry for an entire parity group that keeps track of whether the parity group has a parity update in progress. If a parity update is in progress and an unscheduled stop occurs, then the log is examined in an attempt to reconstruct the parity block. The problem when there are multiple controllers responsible for writing information to the disks is that there is no way to know whether the recovery will be correct because there is no way of knowing if the changes that are logged for the parity disk were actually made and completed for the data disks. Depending upon the nature of the unscheduled stop and whether the data changes were actually made, these techniques may or may not be sufficient to appropriately reconstruct data or parity in a parity group.
Although techniques have been developed to maintain consistency and enable efficient reconstruction of data in the event of an unscheduled stop for hardware and software array storage systems having a single centralized controller, it would be desirable to provide a solution to these problems for a distributed file system in the context of a software-implemented array storage system having multiple controllers.
The present invention provides a controller fault recovery system for recovering from faults that cause unscheduled stops for a distributed file system operating on an array storage system having multiple controllers. A proxy arrangement protects data integrity in the event of an unscheduled stop on just one controller in the array storage system. An atomic data/parity update arrangement protects data integrity in the event of an unscheduled stop of more than one controller.
In the present invention, an array storage system stores data objects (e.g., files) that are arranged with at least one parity group having a number N of data blocks and a parity block computed from the N data blocks. The array storage system includes an array of storage devices and at least N+1 controllers, where each controller operably controls a unique portion of the array of storage devices. Preferably, a controller is a software-implemented controller comprised of a programmed process running on a computing device. A distributed file system is used to manage the array storage system and has at least one input/output manager (IOM) routine for each controller. Each IOM routine includes a software routine for controlling access to the unique portion of the array of storage devices associated with that controller. Software for the controller fault recovery system that is part of each IOM routine maintains a journal reflecting a state of all requests and commands received and issued for that IOM routine.
In response to a notification that at least one of the IOM routines has experienced an unscheduled stop, the controller fault recovery system in the IOM routine reviews the journal and the state of all requests and commands received and issued for that IOM routine and publishes any unfinished request and commands for the failed IOM routine(s). Publication may be accomplished by sending a message, setting a flag or otherwise communicating to any or all of the IOMs, including the IOM making the publication. The notification may be an external notification that a single IOM has failed, in which case the distributed file system keeps running and the fault recovery system uses a proxy arrangement to recovery; or, the notification may be that all of the IOMs experienced an unscheduled stop because more than one IOM has failed, in which case the distributed file system is restarted and the notification occurs internally as part of the restart procedure for each IOM.
Under the proxy arrangement, the distributed file system assigns a proxy IOM routine for each IOM routine, the proxy IOM routine being an IOM routine different than the assigned IOM routine. The proxy IOM routine includes software for monitoring the assigned IOM routine for a failure and, in the event of a failure of only the assigned IOM routine, issuing a notification to all other IOM routines. The proxy IOM routine then receives from all other IOM routines the publication of any unfinished requests or commands for the assigned IOM routine that has failed. Software in the proxy IOM routine marks in a meta-data block for the assigned IOM routine a state of any data blocks, parity blocks or meta-data blocks associated with the unfinished requests or commands reflecting actions needed for each such block when the assigned IOM routine recovers.
Under the atomic data/parity update arrangement, the distributed file system performs an unscheduled stop of all IOM routines in the event of a failure of more than one of the IOM routines. Upon recovery of at least N of the IOM routines after the unscheduled stop, each IOM routine reviews the journal and state for that IOM routine and the publication of any unfinished requests or commands for that IOM routine from all of the other IOM routines and reconstructs each data block, parity block or metadata block in response, so as to insure that any updates to new blocks of data or new blocks of parity are atomic.
The data/parity update process is atomic because updates to the data block and an associated parity block in a parity group happen together or not at all. When a new data block and parity block are stored in the array storage system by the distributed file system, the new data block and new parity block are each stored by a different storage node. The current address of the data block and parity block for a parity group are not changed in the metadata stored by each storage node in the parity group and the old data block and old parity block are not discarded until it has been confirmed that both the new data block and the new parity block are safely written to disk by the storage nodes for both the data block and the parity block. If an unscheduled stop occurs during the middle of a data/parity update process, journals of all of the transactions between the input/output managers maintained by each storage node involved in the data/parity update process are evaluated to determine the extent to which the data/parity update was completed. If the new data and/or parity is confirmed as accurate and available, either on disk or in the buffers of one of the input/output managers, then the recovery process will complete the data/parity update and update the current address of the data block and parity block. If neither the new data or parity is confirmed as accurate and available, then the recovery process will keep the current address for the data block and parity block at the addresses of the old data block and old parity block.
In a preferred embodiment, the distributed file system utilizes a distributed metadata arrangement in which each storage node maintains its portion of the meta-data and the distributed file system handles requests on a file-oriented basis, instead of block-oriented basis. The file-oriented distributed file system relies on an object-oriented RAID approach where a memory translator for the file system translates requests from clients and keeps track of where blocks are logically located within a file. An input/output manager for each storage node maps the logical blocks to their corresponding physically location in that node""s storage system. Preferably, each storage node also keeps its own journal of transactions between that node and all other nodes relating to data/parity updates, thereby providing for an inherent redundancy in the journaling of data/parity updates that can recover from the failure of any one of the storage nodes, including one of the two storage nodes involved in a data/parity update process.
In one embodiment, a computer-implemented method of storing data objects in an array storage system provides for atomic data/parity updates of data objects. The data objects include at least one parity group having a number N of data blocks and a parity block computed from the N data blocks. The data objects are stored in the array storage system under software control of a distributed file system having at least a number N+1 of input/output manager (IOM) routines. Each IOM routine controls access to a unique portion of the storage system. When a write request to store a new block of data for a data object is received by a first IOM, the first IOM issues an update parity request to a second IOM associated with a parity block for the new block of data. The first IOM then issues a write command to write the new block of data to the array storage system and waits to receive a write command complete from the array storage system for the new block of data. The second IOM receives the update parity request from the first IOM and compute a new block of parity for the parity group that includes the new block of data. The second IOM then issues a write command to write the new block of parity and waits to receive a write command complete from the array storage system for the new block of parity. Each of the first and second IOM""s maintain a journal of all requests and commands received and issued among the IOM""s. In the event of an unscheduled stop of the array storage system, the data parity group of the data object is recovered by reviewing the journal entries for both the first and second IOM and reconstructing the data block or the parity block in response if necessary. Preferably, each IOM makes a journal entry that corresponds to each of the following: a write data command from a requester, an update parity request from another IOM, a write data command complete from the array storage system, a write parity command issued in response to the update parity request from another IOM, a write parity command complete from the array storage system, and an update parity request complete from another IOM.
In another embodiment, the data parity group is provided with a proxy capability that allows the data parity group to continue operating in a reduced capacity in the event of a failure of one of the storage nodes in the data parity group. In this embodiment, read and write requests directed to the failed storage node are handled by another storage node that has been designated as the proxy or secondary for the failed storage node. The proxy storage node recovers the requested data by regenerating the data using the data from all of the other storage nodes in the data parity group together with the parity data. If the request is a read request, the proxy storage node simply regenerates the requested data from the data and parity on other nodes and returns the requested data. If the request is a write request, the proxy storage node uses the regenerated data to determine the update to the parity block corresponding to the new data block that is to be written. Although the new data block cannot actually be written because the failed storage node on which it should be written is unavailable, the proxy storage node directs the parity storage node to update and write the parity block. The proxy storage node also modifies the metadata for the failed storage node by modifying the parity block for the metadata for the failed storage node in a similar manner. When the failed storage node is once again available, the IOM routine reconstructs the metadata using the parity blocks associated with the metadata. The IOM routine then evaluates this metadata to determine which of the data blocks on that storage node need to be reconstructed because they were modified during the period of time when the failed storage node was unavailable. Each data block that is to be reconstructed is now rebuilt by the IOM routine for the now recovered storage node. The IOM routine regenerates the data block from the data blocks on the other storage nodes and the parity block that was modified by the proxy storage node to reflect the changes made to that data block.
In this embodiment, each of the storage node IOMs and the parity storage node IOM (other than a proxy IOM) maintain a journal of all requests and commands received and issued among each other. In the event of an unscheduled stop of the array storage system, the data parity group of the data object is recovered by reviewing the journal entries for all of the IOMs and reconstructing the data/parity block in response if necessary. When a storage node IOM fails, each of the other IOMs uses its journal processing to identify any data/parity updates that were in process between that IOM and the failed IOM. All of the other IOMs communicate with the proxy IOM for the failed IOM which of the data/parity blocks were in process of being updated. That proxy IOM then marks the metadata for those data/parity blocks in the failed IOM that will need to be reconstructed once the failed IOM is restarted or replaced.