1. Field
The primary field of use for the present method and apparatus is in the field of networked storage systems. Such systems may, non-exhaustively, comprise distributed file systems, cloud networks, remote storage or other network-addressable storage arrangements. The present invention relates, more specifically, to a method of, and apparatus for, end to end data integrity. In particular, the present invention relates to a method of, and apparatus for, end to end data integrity using the T10 protocol.
2. Description of Related Art
Data integrity is a core requirement for a reliable storage system. The ability to prevent and, if necessary, identify and correct data errors and corruptions is essential for operation of storage systems ranging from a simple hard disk drive up to large mainframe storage arrays.
A typical hard disk drive comprises a number of addressable units, known as sectors. A sector is the smallest externally addressable portion of a hard disk drive. Each sector typically comprises 512 bytes of usable data. However, recent developments under the general term “advanced format” sectors enable support of sector sizes up to 4 k bytes. When data is written to a hard disk drive, it is usually written as a block of data, which comprises a plurality of contiguous sectors.
A hard disk drive is an electro-mechanical device which may be prone to errors and or damage. Therefore, it is important to detect and correct errors which occur on the hard disk drive during use. Commonly, hard disk drives set aside a portion of the available storage in each sector for the storage of error correcting codes (ECCs). This data is also known as protection information. The ECC can be used to detect corrupted or damaged data and, in many cases, such errors are recoverable through use of the ECC. However, for many cases such as enterprise storage architectures, the risks of such errors occurring are required to be reduced further.
One approach to improve the reliability of a hard disk drive storage system is to employ redundant arrays of inexpensive disk (RAID). Indeed, RAID arrays are the primary storage architecture for large, networked computer storage systems.
The RAID architecture was first disclosed in “A Case for Redundant Arrays of Inexpensive Disks (RAID)”, Patterson, Gibson, and Katz (University of California, Berkeley). RAID architecture combines multiple small, inexpensive disk drives into an array of disk drives that yields performance exceeding that of a single large drive.
There are a number of different RAID architectures, designated as RAID-1 through RAID-6. Each architecture offers disk fault-tolerance and offers different trade-offs in terms of features and performance. RAID controllers provide data integrity through redundant data mechanisms, high speed through streamlined algorithms, and accessibility to stored data for users and administrators.
RAID architecture provides data redundancy in two basic forms: mirroring (RAID 1) and parity (RAID 3, 4, 5 and 6). The implementation of mirroring in RAID 1 architectures involves creating an identical image of the data on a primary disk on a secondary disk. RAID 3, 4, 5, or 6 architectures generally utilise three or more disks of identical capacity. In these architectures, two or more of the disks are utilised for reading/writing of data and one or more of the disks store parity information. Data interleaving across the disks is usually in the form of data “striping” in which the data to be stored is broken down into blocks called “stripe units”. The “stripe units” are then distributed across the disks.
Therefore, should one of the disks in a RAID group fail or become corrupted, the missing data can be recreated from the data on the other disks. The data may be reconstructed through the use of the redundant “stripe units” stored on the remaining disks. However, RAID architectures utilising parity configurations need to generate and write parity information during a write operation. This may reduce the performance of the system.
For a system with local storage, the American National Standards Institute's (ANSI) T10-DIF (Data Integrity Field) specification format enables data protection. The T10-DIF format specifies data to be written in blocks or sectors of 520 bytes. This is shown schematically in FIG. 7. The 8 additional bytes in the data integrity field provide additional protection information (PI), some of which comprises a checksum that is stored on the storage device together with the data. The data integrity field is checked on every read and/or write of each sector to verify data integrity between system memory and a host bus adapter (HBA). This enables detection and identification of data corruption or errors. T10-DIF is hardware-based, where an I/O controller adds the protection information (PI) that is then verified by the storage device hardware. Therefore, T10-DIF is only suitable for localised hardware because it cannot protect across a network.
ANSI T10-DIF provides three types of data protection: logical block guard (GRD) for comparing the actual data written to disk, a logical block application tag (APP) and a logical block reference tag (REF) to ensure writing to the correct virtual block. The logical block application tag is not reserved for a specific purpose.
In general, the operation of T10-DIF in a local storage system is shown in FIG. 7. A byte stream is generated by a client application. This is then formatted by an Operating System (OS) into a byte sector of 512 bytes. The I/O controller (or host bus adapter) then appends 8-bit PI to the 512 byte sector to form a 520 byte sector. This is then sent via a storage area network (SAN) to the RAID array and eventually to the disk drive where the data is written as a 520 byte sector. The PI is checked at each of these stages.
A further extension to the T10-DIF format is the T-10 DIX (data integrity extension) format which enables 8 bytes of extension information to enable PI potentially to be piped from the client application directly to the storage device.
This process is illustrated in FIG. 8. The same data format of 520 byte sector is used in T10-DIX as for T10-DIF. However, in this instance, 8 bytes of PI is generated by the user application or OS along with the 512 byte sector. The 8 byte PI is then checked at every stage in the transfer of data to the storage disk drive.
Data protection measures such as RAID, T10-DIF and T10-DIX are useful to prevent data corruption and errors occurring locally on a storage system. However, storage solutions are now generally accessible across networks. For example, distributed file systems are now common. A distributed file system (or network file system) is a file system that allows access to files from multiple hosts (or clients) sharing via a network such as an Ethernet or the internet. This makes it possible for multiple users on multiple machines to share files and storage resources. The most common arrangement is a client-server architecture in which the server provides a service to requesting hosts.
A commonly used distributed file system is Lustre™. Lustre is a parallel distributed file system, generally used for large scale cluster computing. Lustre file systems are scalable and are able to support many thousands of clients and multiple servers. A Lustre file system generally comprises three units: a metadata server (MDS), one or more object storage servers (OSSs) and one or more clients which access the OSSs.
The MDS generally comprises a single metadata target (MDT) per file system that is operable to store metadata such as filenames, directories, access permissions, and file layout. The MDT data is generally stored in a single local disk file system.
The OSSs store file data on one or more object storage targets (OSTs). Each OST manages a single local disk file system. Lustre presents all clients with a unified namespace for all of the files and data in the file system and allows read and write access to the files in the file system.
One of the challenges of a distributed file system such as Lustre is to provide efficient end-to-end data integrity. Therefore, mechanisms are required to ensure that data sent by a client node (e.g. client computer or application) is stored on a respective OST correctly. In other words, it is desirable for data written by a client application to be verified as correctly saved on a storage device. This is not possible using techniques such as RAID, T10-DIF and T10-DIX. This is because these techniques act locally on a server whereas in a network file system data corruption may occur at other locations, for example on the client or across the network before the data arrive at the server.
In order to address this issue, many Lustre systems use an I/O checksum which provides “over-the-wire” verification. The Lustre checksum uses a CRC-32 algorithm to detect single bit errors and swapped and/or missing bytes. A Lustre network checksum is generally provided per 1 MB client remote procedure call (RPC). However, other checksum algorithms may also be used; for example, Adler32 or CRC-32c.
However, whilst the Lustre checksum is able to provide “over-the-wire” verification of data, these checksums are only able to provide protection over a network. They cannot provide protection locally on a server. In addition, the Lustre checksum algorithm requires a significant amount of computational resources on the server side to execute the necessary checksums. This is particularly an issue since the number of checksums to be calculated scales with the number of clients, and there are generally considerably more clients (and therefore considerably more RPCs to execute) than there are servers. Consequently, storage system performance can be affected if the OSS and MDS server hardware is not powerful enough to handle both the I/O, file system administration and checksum services concurrently.
Therefore, in summary, the existing T10 DIF format is operable to provide data integrity locally. In addition, the network checksum can only provide data integrity over the wire. Neither of these approaches can provide data integrity on the client itself, let alone complete end to end data integrity. Furthermore, T10 DIF and network checksums are not linked together in any way, increasing complexity and computational loads on the servers.
Consequently, to date, known storage systems suffer from a technical problem that end to end data integrity cannot be achieved reliably on existing storage systems and without placing high demand on server resources.