1. Technical Field
The present invention relates to an improved data processing system, and in particular, to a method and apparatus for improving the reliability of host data stored on Fibre Channel attached storage subsystems.
2. Description of Related Art
Host data that is written to Fibre Channel or iSCSI attached storage subsystems may pass through many layers of the operating system before it is written to disk. At any point along the data transfer path, the data may be corrupted. As a result, corrupted data may be written to disk without the host's knowledge. The data corruption is not tracked at the time of occurrence and may lead to serious problems. For example, when corrupted data is later read from disk, no indication is provided to the host that the data has been previously corrupted, and the host assumes that the data is valid for use. The host's use of the corrupted data typically results in host errors. This situation also will lead to the scenario of not being able to pinpoint the source of corruption and not able to correct the problem.
Although the current art contains a number of solutions for improving the reliability of host data stored Fibre Channel attached storage subsystems, these solutions all contain significant shortcomings. For example, some host resource-managers (i.e. file systems, databases, etc.) compute and maintain checksums for data at the time the data is written to disk. A checksum is an integrity protection measure that is performed by adding up components in a string of data and storing the data. It may later be verified that the object was not corrupted by performing the same operation on the data, and checking the “sum”. After reading previously written data but prior to using the data, the resource manager computes a checksum on the data read from disk and compares it to the checksum computed at the time data was written to disk. If the data has been corrupted at the time of the write, the resource manager detects the corruption through a miss-compare of the two checksum values and, consequently, does not use the corrupted data.
Within this solution, a number of major drawbacks exist. First, detection of the corruption occurs at the time the corrupted data is re-accessed from disk. This detection may be well after (days, months, years) the time at which the data was corrupted and at points in processing where it is difficult or impossible for the resource manager or an application on the host using the resource manager (i.e., file system) to recover from the corruption. Second, this solution is specific to a particular resource manager. For instance, as the application/database layer only manages this information by either storing it as part of the data or its headers, the information is not understood by other layers in the stack of software involved. Third, although this solution detects corruption, it does nothing in the way of identifying the point at which the corruption occurred. The only thing that is known is that the data was corrupted between the time a write request was issued for the data by the resource manager and the data was later read by the resource manager. Finally, the solution does not provide end-to-end data verification and has a window of vulnerability in which data corruption may occur and not be detected. For the most part, resource managers store application data. Once provided to the resource manager, application data may be corrupted while it resides with the resource manager, but prior to being written to disk by the resource manager. In this case, the corruption will not be detected and the application will be provided with corrupted data that is read from disk, with no indication given to the application that the data is invalid.
The current art provides another solution through the combined support of a specific host resource manager, in this case a database, and a Fibre Channel attached storage subsystem. The database data written to disk is of a fixed data and block format and contains a checksum value within the data format. Prior to writing a data block to disk, the database computes a checksum for the data contents of the block and records the computed checksum in the block. The data and block format of database are understood by the storage subsystems, and prior to satisfying a write request for a database block, the storage subsystem computes the checksum for the data contents of the block and compares this to the previously computed checksum. If the data has been corrupted in the write path, the storage subsystem detects this through a miss-compare of the two checksum values. If a corruption is detected, the storage subsystem does not write the block to disk, but rather signals an error condition back to the data base for this write request.
While this solution is an improvement over the previous solution described above in that the data corruption is detected earlier and affords better recoverability, it still has a number of major drawbacks. First, it imposes a fixed data and block format that includes a checksum value. Second, it imposes the requirement that a component outside of the database, namely, the storage subsystem, have intimate knowledge of the database-specific format. Third, this solution, like the previous solution above, also suffers from the problem that is does little in the way of identifying the source of the corruption. It narrows the point of corruption to operating systems and Fibre Channel input/output (I/O) paths involved in written the data, but these are significant paths, and in the case of the operating system, made up of many components.
A third solution in the current art is provided through an extension of the Fibre Channel protocol to include a cyclical redundancy check (CRC) control value for verifying packets. Similar in concept to a checksum, a CRC value is computed at the host for host data to be transmitted over the Fibre Channel link by the Fibre Channel adapter as part of writing host data to disk. The CRC value is sent along with the host data to the storage subsystem over the Fibre Channel. On receipt, the storage subsystem computes a CRC value for the received data and compares it against the received CRC. If the data has been corrupted during transmission, this is detected by a miss-compare of the CRC values and the storage subsystem does not write the data to disk and errors off the packet. The major drawback of this solution is that it does not provide end-to-end verification of data and only detects corruption that has occurred in the transmission of data across the Fibre Channel link.
Therefore, it would be advantageous to have an improved method for improving the reliability of host data stored on Fibre Channel attached storage subsystems. It would further be advantageous to have an end-to-end solution for data reliability between a host system and a Fiber Channel attached storage device without any restriction on the form, structure, or content of the data transacted.