The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for protecting storage fabrics from an errant device causing a single point of failure.
A storage area network (SAN) is a network of storage disks. In large enterprises, a SAN connects multiple servers, or hosts, to a centralized pool of disk storage. Compared to managing hundreds of servers, each with its own disks, a SAN improves system administration. By treating all the company's storage as a single resource, disk maintenance and routine backups are easier to schedule and control. In some SANs, the disks themselves can copy data to other disks for backup without any processing overhead at the host computers.
A storage system may lead to a single point of failure (SPOF) when an errant device floods the sub-system's communication channels with traffic that is unsolicited and out of protocol. A device may send traffic that the initiating device is not expecting. For example, in a Serial Attached SCSI (SAS) fabric, an initiator may send IO to a device, which then completes normally. The device may enter an error condition and, as a result, issue repeated SAS Response frames for the already completed IO. The controller may experience a variety of problems.
It is extremely likely that a controller will simply ignore such a frame, which on the surface seems an adequate response. Devices that issue many repeated frames cause the controller to spend significant operational resources to detect an “out of context” frame, disregard the frame, and also ensure that such a frame has no knock-on consequences, such as a device not doing its normal job or indicating that a job is done when the job is not done.
For example, due to hardware assist, there are known issues in this regard in current products. At some point, the controller may wish to re-use the protocol supplied TAG that matches the one from the errant device. A TAG is a unique identifier (ID) for an operation. This identifier is part of the protocol. There are a limited number of TAGs, so they must be re-used. The controller is at liberty to do this because the TAG has been successfully responded to. Receiving the errant response frame after re-using the TAG causes further issues with the controlling device, which may end up with hardware possibly in an invalid state. This ultimately could lead to a system catastrophe.
The controller is unable to dedicate enough resources to progress operations in a timely fashion and has no method of isolating the device at all. With dual-ported devices, an errant device could potentially fail on both SAS fabrics.
Thus, an errant device issuing out-of-context, or even malicious, traffic may cause grief for a controlling device. Such a situation may lead to IO error and potential Denial of Service both at the SAS level and also at the controller level, as the controller is spending resources dealing with the SAS bombardment such that those resources do not contribute to normal service.