An important problem in data storage is providing shared file access for a compute cluster comprised of many independent processors connected via a high speed network. In a number of interesting cases, the compute cluster is accessing a single file, and in this case, it is a challenging to provide sufficient bandwidth from the entire compute cluster to this single file.
Previous approaches to this problem follow one of two architectures. In one class of solution, implemented by Sistina and PolyServe, for example, bandwidth to a single file is scaled by providing multiple servers that coordinate their access to the logical storage array (LUN) holding the file. These systems perform a complex distributed locking scheme to coordinate access to the LUN, coordinating, specifically, such operations such as disk block allocation, allocation of blocks to files, allocating inode numbers to files, and building indirect block trees. These systems are typically inefficient, as their locking overhead is very high.
In another class of solution, typified by the PVFS system, data is striped among multiple servers through an additional file system layer built on top of a normal file system. In PVFS, updates to the various strip files in the resulting file system are not coordinated very closely, and operations that deal with global file properties, such as the file length, are implemented very expensively, or via approximations that may cause application errors. For example, in PVFS, determining the length of a file requires reading the individual file lengths from all of the strips, and taking the largest returned result, an expensive procedure. Similarly, an accurate modification time is important for file systems whose data is exported via the Network File System (NFS) protocol, which uses the file's modification time as a version number. But PVFS, and similar parallel file systems, return the modification time for a file via a similar procedure to that returning the file length: they check with all servers and return the largest modification time field. Since the different servers have clocks that differ by no less than small numbers of microseconds, it is possible for a write to be performed at the server responsible for one stripe that happens to have the furthest advanced clock, and then perform a write to another server with an older clock, with the result that the second write does not advance the system wide file modification time. Having two versions of the file with the same modification time may cause incorrect behavior by protocols like NFS that use modification times as version numbers. Because of these problems, PVFS file systems are unsuitable for export over a network with NFS.
This invention differs from the current art by providing a solution that combines the efficient locking of a striped solution like PVFS with correct and efficient file attribute retrieving required for exporting data with NFS.