(1) Field of the Invention
This invention relates to means for detecting the identity of a channel containing a faulty data storage unit in a data-storage system in which multiple data storage units are arrayed in channels and are buffered and synchronized by a hardware mechanism which handshakes simultaneously with all of the channel buffers.
(2) Description of Related Art Including Information Disclosed Under 37 CFR 1.97 and 37 CFR 1.98.
In an architecture where multiple channels of data from independent sources are synchronized, buffered, and merged by hardware, the presence of a channel hanging in its data-transfer phase can freeze the transfer without any explicit indication of which might be the faulty channel.
A variety of failures can occur in a disk storage system. RAID systems generally are able to tolerate many errors by techniques which use coding. Some failures cannot be handled, and are classified as follows. 1. Transient failures. Unpredictable behavior of a disk for a short time. 2. Bad sector. A portion of a disk which cannot be read, often for physical reasons. 3. Controller failure. The disk contents are unaffected, but because of controller failure, the disk cannot be read. 4. Disk failure. The entire disk becomes unreadable, generally due to hardware faults such as a disk head crash. Pankaj Jalote, Fault Tolerance in Distributed systems, Prentice hall, Englewood Cliffs, N.J., 1994, pages 100-101.
In a RAID 1-5 system with single-channel fault-tolerance, for example, it is desirable to isolate the faulty channel and continue system operation without it. Other systems using multiple synchronized buffered independent channels, such as an array of distributed sensors from which signals are merged in hardware before processing also require detection of faulty channels so the system can function in their absence. Such arrayed distributed sensors would involve merging of signals and computation in hardware, in order to provide redundancy, so that faulty channels may be excluded. There is a need for a simple, efficient method for detecting faulty channels in such systems.
RAID 1-5 systems are characterized by several external data channels which may be buffered and synchronized by a hardware mechanism which handshakes simultaneously with all of the channel buffers. Such a mechanism enforces synchronization. It is especially useful in achieving high performance when data are byte-striped or hardware is used to overcome the write-parity-generation penalty. In the case of a channel hang or failure, it is important to determine which channel is at fault, because the fault may stem from the interface circuitry for the channel, the device connected to the channel (in a RAID system, a storage disk), or a faulty connection within the channel itself.
The usual mechanism for detecting a hang condition is a timeout. In this case, the microprocessor controlling the transfer will set a timer at the beginning, and if the timer runs out before the transfer is completed, it is assumed that a hang condition exists somewhere in the system. Unfortunately, synchronization hardware does not distinguish between the individual channel handshaking signals. The usual mechanism does not indicate which channel in the array has ceased to handshake (or where in the interface circuitry leading from the individual channels a fault is located).
U.S. Pat. No. 5,598,578 discloses a process for discriminating between event words which report temporary channel path errors from event words of higher significance requiring a system reset completion. This system is used to avoid the situation of a event word buffer which is overwhelmed by minor errors and fails to store and report the indication of errors of higher significance.
U.S. Pat. No. 5,640,604 discloses a buffer reallocation system which accesses the amount of used and unused buffer areas in the buffer store and assigns buffers to programs which request buffers.
U.S. Pat. No. 5,604,866 discloses a system for controlling the flow of data to a buffer in which the receiver sends a signal for each free space in its buffer. The sender keeps track of the count and when the buffer is full, the sender will send additional data only after receiving signals from the receiver indicating additional free space in the buffer.
U.S. Pat. No. 5,619,675 discloses a cache buffer management system in which a history table is used to indicate which of several buffers is less recently referenced. The less referenced buffer is then eligible for overwriting.
U.S. Pat. No. 5,604,866 discloses a system for controlling the transmission of messages of a predetermined size from a CPU to a buffer for an I/O device, which buffer has a capacity for a predetermined number of messages. A counter on the CPU notes the number of messages sent to each buffer and sends messages only when the buffer has the capacity to accept the message.
None of the prior art devices provide a software-based method for determining the location of a defective channel. This invention allows one to determine which channel is defective in the event of a channel failure, whether two or more than two channels are involved.