Advances in communication technology have allowed numbers of machines to be aggregated into computing clusters of great processing power and storage capacity that can be used to solve much larger problems than could a single machine. Because clusters are composed of independent and effectively redundant computers, they have a potential for fault-tolerance. This makes them suitable for other classes of problems in which reliability is paramount. As a result, there has been a great interest in clustering technology in the past several years.
Cluster file systems found in the arts include IBM's General Parallel File System (GPFS). GPFS is a parallel, shared-disk file system for cluster computers available on the RS/6000 SP parallel supercomputer and on Linux clusters that provides, as closely as possible the behavior of a general-purpose POSIX file system running on a single machine.
One drawback of clusters is that programs must be partitioned to run on multiple machines. It can be difficult for these partitioned programs to cooperate or share resources. Perhaps one of the most important resource is the file system. In the absence of a cluster file system, individual components of a partitioned program share cluster storage in an ad-hoc manner. This can complicate programming, limit performance, and compromise reliability.
Some cluster file systems allow client nodes direct access to metadata, such as directories and file attributes stored on data servers alongside the file data (distributed metadata), and use a distributed locking protocol to synchronize updates to these metadata. Other systems, such as SAN-FS, Lustre and P-NFS, use one or more dedicated metadata server nodes to handle metadata.
Traditional supercomputing applications, when run on a cluster, require parallel access from multiple nodes within a file shared across the cluster. Other applications, including scalable file and web servers and large digital libraries, are often characterized by interfile parallel access. In the latter class of applications, data in individual files is not necessarily accessed in parallel. But since the files reside in common directories and allocate space on the same disks, file system data structures (metadata) are still accessed in parallel. In large computing systems, even administrative actions such as adding or removing disks from a file system or rebalancing files across disks, can involve a great amount of work.
The advantage of a cluster file system over a traditional file server is that by distributing data over many data servers, higher aggregate data throughput can be provided. Cluster file systems that use a dedicated metadata server often provide little advantage when it comes to metadata operations such as file creates and deletes, since these operations are usually handled by a single metadata server.
By allowing all client nodes to create or delete files in parallel, cluster file systems with distributed metadata exploit parallelism to achieve higher metadata throughput. However, whenever two nodes create or delete files in the same directory, these updates must be properly synchronized to preserve file system consistency and to provide correct file system semantics. This limits parallelism and negates the advantage of distributed metadata when many nodes are updating the same directory. It is not uncommon for a parallel application to have each node create one or more working files in the same directory when the job starts up. The resulting lock conflicts can serialize all of these updates and require synchronous I/Os to commit and flush each update back to disk before the next node can lock the block. Due to these additional synchronous I/Os, a set of create operations from multiple nodes takes longer to complete than a single node creating the same number of files.
There exists a need to overcome the problems discussed above, and, more particularly, to avoid conflicts on directory blocks while still allowing the bulk of a file create or delete operation to be performed independently and in parallel by all of the nodes in a cluster file system.