1. Field of the Invention
An embodiment of the present invention relates to a disk array apparatus, a disk array control method, and a disk array controller, which may include a disk array apparatus, a disk array control method, and a disk array controller in which an abnormal path can be correctly degenerated when there occurs the path abnormality that looks like a disk abnormality.
2. Description of the Related Art
In a storage unit, a technique has been proposed that counts a number of detecting failures in every separable region, and determining and separating a region in which a trouble has occurred by a statistical analysis using a result of the count (refer to Japanese Patent Laid-Open No. 11-296311).
Also, in a disk array controller, a technique has been proposed in which a value of a management table is increased when a failure occurs, and, when the value exceeds a threshold, the concerned interface is separated (refer to Japanese Patent Laid-Open No. 10-275060).
Also, in a disk array apparatus, a technique has been proposed in which, when a failure occurs, points of the failure for the component is deducted, and, when the points is below a point reference value, the concerned component is degenerated (refer to Japanese Patent Laid-Open No. 2004-252692).
We have examined to degenerate not only the disks D00 to D2B but also a plurality of paths P1 to P7 by a statistical score addition processing for a disk array apparatus, as illustrated in FIG. 7. However, in a case of degenerating both the disk and the path by the statistical score addition processing, we have found that the following problem exists.
That is, in such disk array apparatus, it is assumed that the abnormality occurs on a path P1 (or P2) between a control unit RoC#0 and a switch unit BE Exp (SAS switch) contained in a controller module CM#0. In this case, even when access to any of the disks D00 to D2B is made, the control unit RoC#0 considers that an SAS error time-out occurs. The SAS error time-out is usually the error returned in a case of disk abnormality. Therefore, the control unit RoC#0 adds statistical scores, judging that the accessed disk is abnormal, and degenerates the concerned disk when the number of scores exceeds a threshold (corresponding to the case that the SAS error time-out occurs four times). The degenerate disk can not be reused unless the maintenance is performed.
Referring to FIGS. 8 to 10, a processing for degenerating the path abnormality or disk abnormality in the disk array apparatus of FIG. 7 will be described below.
Now, it is assumed that the path P1 between the control unit RoC#0 and the switch unit BE Exp#0 is abnormal. Due to this abnormality, the control unit RoC#0 considers that the same SAS error time-out occurred in a plurality of disks D00 to D2B. For example, for the sake of simpler explanation, it is assumed that the control unit RoC#0 considers that the errors occurred in the disks D19, D05, D20, D19, D05, D20, D19, D05, D20 and D19 in this order. The number 19 in “D19” designates the ID of the disk (same for others).
The number of scores for every path and disk in the statistical score addition table 25 is made “0” in an initial state of the disk array apparatus, as illustrated in FIG. 8A. When the number of scores for a path or disk exceeds “255” in the statistical score addition table 25, the concerned path or disk is separated (or degenerated).
Firstly, in the access from the control unit RoC#0 to the disk D19, the control unit RoC#0 detects the SAS error time-out. Accordingly, the control unit RoC#0 adds “10” to the paths P1 and P4 that are the paths from the control unit RoC#0 to the concerned disk D19, and adds “80” to the concerned disk D19 in the statistical score addition table 25. Since the SAS error time-out is the error returned in the case of the disk abnormality as described above, the sufficiently higher number of scores is added to the concerned disk than the concerned path. More specifically, the number of scores for disk is several times (eight times in this case) the number of scores for path. As a result, the statistical score addition table 25 is updated as illustrated in FIG. 8B. In these, the added parts are indicated with the underline (hereinafter the same for FIGS. 8C to 10).
Next, the control unit RoC#0 checks the updated statistical score addition table 25, and investigates whether or not there is any path or disk to be separated, namely, having the number of scores greater than or equal to “255”. In this case, since there is no path or disk to be separated, the control unit RoC#0 performs subsequent access to the disk as usual.
Next, access to the plurality of disks is repeated in the same way as above, so that “10” is added to the concerned path and “80” is added to the concerned disk every time of the access. As a result, the statistical score addition table 25 is updated successively, and it is investigated whether or not there is any path or disk having the number of scores greater than or equal to “255” every time, as illustrated in FIGS. 8C to 10C.
That is, the statistical score addition table 25 is updated successively by access to the disk D05 as illustrated in FIG. 8C, by access to the disk D20 as illustrated in FIG. 8D, by access to the disk D19 as illustrated in FIG. 9A, by access to the disk D05 as illustrated in FIG. 9B, by access to the disk D20 as illustrated in FIG. 9C, by access to the disk D19 as illustrated in FIG. 9D, by access to the disk D05 as illustrated in FIG. 10A, by access to the disk D20 as illustrated in FIG. 10B, and by access to the disk D19 as illustrated in FIG. 10C.
When the control unit RoC#0 checks the statistical score addition table 25 of FIG. 10C, the number of scores of the disk D19 exceeds “255”. Thus, the control unit RoC#0 separates (degenerates) the disk D19 from the concerned disk array.
As will be clear from the above, the path P1 between the control unit RoC#0 and the switch unit BE Exp#0, which is the essential abnormal part, can not be separated even after gaining access to the disk D the considerable number of times. On the other hand, the normal disk D19 is separated.
Further, after the disk D19 is separated, the number of scores in the statistical score addition table 25 is unchanged, so that other normal disks are also separated after the separation of the disk D19. For example, when the disk D05 is accessed after a state of FIG. 10C, the normal disk D05 is separated. The disk D20 is likewise separated. Accordingly, the same error occurs in every disk D while repeating access to the disk D, so that plural normal disks are degenerated. As a result, RAID is occluded, disabling the input/output processing for a host computer 1 to be performed, resulting in an abnormal job.
Properly, it is desirable that the control unit CM#0 is degenerated, retry is made via the path from the host computer 1 to the control unit CM#1, and access from the control unit CM#1 to the concerned disk D is made to continue the input/output processing. For this purpose, it is desirable that the disk D is not degenerated, but the correct abnormal part (or doubted part) is degenerated in the case of not disk abnormality.