This invention relates to data storage in a computerized storage area network (SAN) or system utilizing multiple controllers. More particularly, the present invention relates to a new and improved technique of determining whether one of the controllers or a device connected to the controller is functioning properly. Rather than merely detecting a lack of response to a data access request and inferring that something is not working, a test of certain capabilities of the controller is initiated so that particular problems can be diagnosed.
In a computerized storage area network (SAN), various storage devices, such as hard drives, compact disc (CD) drives, tape drives and the like, are used to store data. The storage devices are typically arranged in groups, such as a RAID (Redundant Array of Independent Drives) configuration. One or more redundant disk array controllers (a.k.a. RDAC) are connected to each group of storage devices to control access to the storage devices. The groups are sometimes contained in storage units, such as storage arrays, so the controllers handle data accesses between the individual storage devices within the storage array and other components of the SAN outside of the storage array.
The storage area network (SAN) also typically includes a plurality of host devices connected through a switched, or network, fabric to the storage arrays. The host devices access a plurality of logical data volumes present on the storage devices in the storage arrays, usually on behalf of a plurality of client devices which are typically connected to each host device. Each storage array is connected at the controllers to one or more host devices through the network fabric.
Each host device can typically transfer data with each storage array and the logical data volumes stored therein through more than one data path. Each data path extends through the switched fabric to one of the controllers in the storage array. Since the storage array typically contains two (and possibly more) of the controllers, the host device typically has two (and possibly more) data paths to each storage array. The controllers are xe2x80x9credundantxe2x80x9d because typically either one can satisfy data access requests from any host device to any storage device or logical data volume on the storage array.
The redundancy ensures that the logical data volumes will be available to the host devices in the event that one of the data paths develops a problem or fails to operate. If a host device detects a failure in one of the data paths to a storage array, the host device switches to the other data path to access the storage array.
The host device typically detects the failure when the host device sends a data access request through the data path, but either a response is not returned within a predetermined time period or the response includes an error notification. The problem that caused the error or failure may have occurred in the data path (e.g. in the switched fabric, a networking device, a cable or other component of the data path) or in the host device (e.g. in a network interface card or host bus adapter through which the host device accesses the switched fabric) or in the storage array (e.g. in the array controller, the storage device or other component of the storage array). However, no determination is made by the host device regarding the cause of the failure. Instead, a notification is sent to a system administrator indicating the data path that is not responding. It is typically then left to the system administrator to perform the burdensome task of diagnosing or troubleshooting the problem that caused the failure.
It is with respect to these and other background considerations that the present invention has evolved.
The present invention relieves some of the burden from the system administrator for troubleshooting the problem that caused a failure in a data path by automatically initiating a test of one or more of the array controllers in the storage array and disabling certain non-functional equipment when a problem is detected. The present invention also monitors the functional condition or status of the storage array by periodically initiating the test of the array controller(s), so the status of the storage array can be determined even before the host device has detected a failure or error.
One of the array controllers initiates the test of the other array controller, so if the controller under test is not functioning properly, the controller initiating the test can provide explanatory results of the test to the host device or the system administrator. The test checks the operation of parts of the array controller, the storage devices and the network fabric, so if the problem exists in one of these components of the storage area network, the explanatory results can provide the location of the problem for the system administrator, who can then quickly correct the problem. Even if the test does not identify a problem in any of the checked components, when the host device, nevertheless, has detected a failure, then the test will have eliminated the checked components as the source of the problem, so the system administrator can focus any troubleshooting efforts elsewhere.
These and other improvements are achieved by testing the operational condition of one of the controllers in a computerized system that has at least two controllers and one or more storage devices. The controllers are for controlling access to computerized data stored on the storage devices. The second controller sends a test command to the first controller to cause the first controller to execute predetermined operating functions. In response, the first controller attempts to perform the predetermined operating functions, preferably by directing certain data access commands to the storage devices. The outcome of the attempted predetermined operating functions is analyzed to determine whether the first controller was successful in performing the predetermined operating functions. The operational condition of the first controller is then determined based on whether the first controller was successful in performing the predetermined operating functions.
The controller under test preferably performs a read operation and/or a write operation on one or more of the storage devices to test its ability to access the storage devices. For the read operation, the controller initiating the test preferably writes some test data to the storage devices and then passes some test information to the controller under test with which the controller under test can check the test data after reading the test data from the storage devices. For the write operation, the controller under test preferably generates additional test data from the same test information and writes the additional test data to the storage devices, so the controller initiating the test can read the additional test data and check it with the original test information. Additionally, to perform either or both of the read and write operations, the controller under test preferably issues read and/or write commands to itself, to which the controller under test responds in a normal fashion as if the read and/or write commands were generated externally. Furthermore, the computerized system is preferably part of a networked storage system, and the controller under test preferably sends the read and/or write commands to an external device, such as a network device, that returns, or xe2x80x9cloops back,xe2x80x9d the commands to the controller under test.
The previously mentioned and other improvements are also achieved in a storage array for servicing data access requests received from the host devices through the network. The storage array includes an array of storage devices, two array controllers and a memory device (e.g. memory RAM). The array controllers are connected to each other, the network, the array of storage devices and the memory device. The memory device contains firmware instructions that cause the array controllers to perform a test of the operational conditions of one of the array controllers in which the second array controller initiates the test of the first array controller to determine whether the first array controller is operating. The first array controller attempts to perform predetermined operating functions, preferably reading data from and writing data to the array of storage devices. The outcome of the predetermined operating functions are analyzed to determine whether the first array controller was successful in performing the predetermined operating functions, which indicates the operational condition of the first array controller.
Under the read data function, the firmware instructions preferably cause the second array controller to generate test data and write it to the array of storage devices and the first array controller to read the test data from the array of storage devices and detect whether the test data is correct. Additionally, the second array controller preferably generates test information, which it uses to generate the test data and which the first array controller uses to detect whether the test data is correct.
Under the write data function, the firmware instructions preferably cause the first array controller to generate the test data and write it to the array of storage devices and the second array controller to read the test data from the array of storage devices and detect whether the test data is correct. Additionally, the first array controller preferably generates the test data from the test information that the second array controller sent to the first array controller. The second array controller uses the test information to determine whether the test data is correct. To perform the write data function under the firmware instructions, the first array controller preferably issues a write command to itself by sending the write command to the network with instructions to return the write command to the first array controller, so the first array controller can respond to receiving the write command by performing the write data function.
A more complete appreciation of the present invention and its scope, and the manner in which it achieves the above noted improvements, can be obtained by reference to the following detailed description of presently preferred embodiments of the invention taken in connection with the accompanying drawings, which are briefly summarized below, and the appended claims.