1. Field of the Invention
This invention relates to an apparatus, system, and method for identifying faulty communication modules. Specifically, the invention relates to an apparatus, system, and method for identifying faulty communication modules within a data storage system.
2. Description of the Related Art
Typical data storage systems include a client and a server. The client, commonly referred to as the host, transmits data to the server. The server, in turn, stores the data. Various types of adapters may be used for communication between the host and the server. A commonly used adapter is the peripheral component interconnect (PCI) adapter. One representative example of a PCI adapter is a host interface card. Often, the server includes a plurality of host interface cards available for connecting a plurality of hosts to the server. Data transmitted to the server is temporarily stored within a host interface card before the data is written to a permanent storage media within the server, such as a disk array. In addition, when a host requests data stored on the server, host interface cards retrieve the data from the permanent storage media and transmit the data to the host.
Occasionally during the data storage and retrieval process, the host interface card identifies data that is corrupt. Corrupt data often means that one of the host interface cards that processed the data has malfunctioned and may continue corrupting other data. It is desirable to identify the malfunctioning host interface card to prevent further data corruption.
However, identifying the malfunctioning host interface card may be difficult, because data is often stored on the server using one host interface card and retrieved using a different host interface card. Conventional data storage systems logically conclude that the host interface card that discovered and reported the corrupt data is the faulty host interface card. As a result, data storage systems remove that host interface card from operation when, in reality, the host interface card that originally processed the data may be the faulty host interface card that corrupted the data. Consequently, a properly functioning host interface card may be removed or disabled while a faulty host interface card remains in operation, corrupting data within the data storage system.
FIG. 1 illustrates a representative example of a conventional data storage system 100 having a faulty host interface card 106a. The system 100 stores data packets sent from a host 102 to a server 104. Typical data packets include data and a header, which identifies the type of data contained within the data packet. The server 104 includes a plurality of host interface cards 106 for communicating with one or more hosts 102. Most of the host interface cards 106 are operable host interface cards 106b-d. Certain host interface cards 106a may, however, be faulty.
The host interface cards 106 are configured to receive, temporarily store, and transmit data packets sent from the host 102 to a symmetric multiprocessor 114. The symmetric multiprocessor 114 includes a data cache 116, non-volatile storage 120, and one or more disk controller cards 118. The symmetric multiprocessor 114 stores each data packet in the cache 116 and optionally the non-volatile storage 120 according to various data storage parameters. In response to data storage parameters, the disk controller 118 writes each data packet to a disk group 122 where the data is permanently stored. The various components of the server 104 communicate by way of a central communication bus 124.
Occasionally, a data packet becomes corrupted while being transmitted from the host 102 to the server 104. The data packet may also become corrupted while being transmitted throughout the various components of the server 104. When data is corrupted, data is changed within the data packet. Generally, data storage systems 100 validate the integrity of data packets to prevent transmitting or storing corrupted data.
The host 102 evaluates the data packet and associates a CRC (cyclic redundancy check) value with the data packet before sending the data packet to the server 104. Typically, a CRC value is four bytes containing a 32-bit binary value. When the host interface card 106 receives the data packet from the host 102, the host interface card 106 computes a conventional CRC data verification to validate the integrity of the data packet received. First, the host interface card 106 evaluates the data packet to calculate a CRC value. Next, the host interface card 106 compares the calculated CRC value with the CRC value associated with the data packet. Then, if the calculated CRC value matches the CRC value associated with the data packet, the host interface card 106 determines that the data packet is valid. If the calculated CRC value does not match the CRC value in the data packet, the host interface card 106 determines that the data packet is corrupt. If the data packet is corrupt, the host interface card 106 rejects the data packet and sends an error to the host 102. If the data packet is valid, the host interface card 106 continues to process the data packet.
LRC (longitudinal redundancy check) data verification is another method for validating the integrity of data packets that typically involves less processing overhead than CRC data verification. The LRC value differs from the CRC value in that it is two bytes containing a sixteen-bit binary value, a much smaller size to process and store. Consequently, LRC data verification is desirable to validate data integrity as the data packet is transmitted between components within the server 104.
As part of the LRC data verification, the host interface card 106 evaluates the data packet and generates an LRC value. The host interface card 106 then associates the LRC value with the data packet before transmitting the data packet to the symmetric multiprocessor 114. When the symmetric multiprocessor 114 receives the data packet from the host interface card 106, the symmetric multiprocessor 114 computes a conventional LRC data verification to validate the integrity of the data packet.
Often, due to a fault in the memory or processor of a faulty host interface card 106, the host interface card 106 may corrupt the data packet after the CRC data verification has been completed, but before the LRC data verification has been executed. As a result, the faulty host interface card 106 may send a corrupt data packet to the symmetric multiprocessor 114, which executes LRC data verification and compares the calculated LRC value with the LRC value associated with the data packet. If the calculated LRC value matches the LRC value in the data packet, the symmetric multiprocessor 114 determines that the data packet is valid. If the calculated LRC value does not match the LRC value in the data packet, the symmetric multiprocessor 114 determines that the data packet is corrupt and rejects the data packet, creating an error in the data storage process. However, the symmetric multiprocessor 114 stores the data packet in both the cache 116 and optionally the non-volatile storage 120, if the LRC data verification validates the data packet.
In the depicted example, the faulty host interface card 106 has corrupted the data packet after the completion of the CRC data verification. Consequently, the faulty host interface card 106 evaluates a corrupt data packet, generates an LRC value, and inserts the LRC value into the corrupt data packet. Furthermore, the symmetric multiprocessor 114 receives the corrupt data packet, evaluates the corrupt data packet, and calculates an LRC value to compare against the LRC value in the corrupt data packet. The LRC data verification does not detect corrupted data packages in the depicted example, because the calculated LRC value and the LRC value in the data packet were calculated from evaluating a corrupt data packet.
After the data packet is validated, the cache 116 or non-volatile storage 120 sends the data packet to the disk controller 118, which then sends the data packet to the disk group 122, where it is stored indefinitely. The disk group 122 may receive a corrupt data packet and execute LRC data verification as previously explained. The LRC data verification does not detect the corrupt data packet because the calculated LRC value matches the LRC value associated with the data packet.
Subsequently, if the host 102 requests the stored data packet, the disk group 122 sends the data packet into cache 116 where the symmetric multiprocessor 114 performs LRC data verification. As previously explained, LRC data verification does not detect a corrupt data packet. The symmetric multiprocessor 114 evaluates the corrupt data packet, calculates an LRC value, and compares the calculated LRC value with the LRC value associated with the data packet.
The symmetric multiprocessor 114 then sends the data packet to a host interface card 106d. Because the data server 104 includes a plurality of host interface cards 106, the symmetric multiprocessor 114 often sends data packets to a different host interface card 106b-d than the host interface card 106a that originally received the data packet from the host 102.
The host interface card 106d executes LRC data verification to validate the integrity of the data packet. The host interface card 106d evaluates the data packet and generates an LRC value. If the calculated LRC value matches the LRC value associated with the data packet, the host interface card 106d executes CRC data verification.
In the depicted example, the host interface card 106d evaluates the data packet and calculates a CRC value. Since the faulty host interface card 106a corrupted the data packet after CRC data verification had been executed, the functioning host interface card 106d evaluates corrupt data and calculates a CRC value that differs from the original CRC value calculated by the host 102 and associated with the data packet. The host interface card 106d sends an error message to the symmetric multiprocessor 114 indicating that the data package is corrupt. Since there are no conventional means to identify which of the plurality of host interface cards 106 corrupted the data packet, the symmetric multiprocessor 114 erroneously concludes that the host interface card 106d that sent the error message is the faulty host interface card.
Generally, to prevent future errors, the symmetric multiprocessor 114 disables, commonly known as “fencing,” the host interface card 106d that sent the error message. Additionally, the symmetric multiprocessor 114 may take the host interface card 106d that sent the error message off-line. The symmetric multiprocessor 114 might also reset the host interface card 106d in an attempt to fix the host interface card 106d that sent the error message. Accordingly, the symmetric multiprocessor 114 may send an error message to the host 102, indicating that the original data packet is corrupt.
Consequently, under some circumstances, the conventional data storage system 100 may erroneously disable the properly functioning host interface card 106d that sent the error message. The symmetric multiprocessor 114 in such a case fails to properly identify the faulty host interface card 106a that actually corrupted the data packet. As a result, the faulty host interface card 106a continues operating and may continue to corrupt data.
What is needed is a method that correctly identifies the faulty host interface card responsible for corrupting the data packet. Furthermore, a data storage system is needed that identifies the faulty host interface card, takes appropriate measures to fix or disable the faulty host interface, and permits properly operating host interface cards to remain operational.