Currently high performance and grid computing is performed by parallel processors. Applications include seismic data processing for oil exploration, climate modeling and weather forecasting, and other kinds of simulation programs.
A drawback of high performance and grid computing has been the rapid obsolescence of high-performance parallel processor systems. In the past, supercomputer development has been a driving force in advancing computer technology. Recently, however, major advances in processing power, storage, and network technology have been driven by world-wide demand for personal computers and high-speed Internet service. For example, inexpensive disk drives originally designed for personal computers are assembled into redundant storage arrays (RAID). Processor chips originally designed for high-end personal computers are assembled onto multi-processor computer blades offered as commodity items. The multi-processor computer blades are assembled into parallel processor systems or programmed to perform specific functions such as storage access servers and network switches.
A problem encountered when assembling a high-performance parallel processor from commodity processors is the scalability of asynchronous access to shared memory and storage. In particular, it is desired for the processors to access shared data at a higher level of abstraction than disk blocks in order to use conventional techniques for storage allocation and maintenance of coherency. In this fashion, the application programming for the parallel processor system can be independent of the hardware implementation. This provides for not only software portability but also incremental upgrading of the shared memory and storage.
Presently, storage vendors are attempting to extend standardized network file system access protocols to eliminate a single-server bottleneck to file access and achieve scalable sharing for file systems and individual files. Perhaps the most advanced extension is the parallel Network File System (pNFS) proposal specified in the Internet Draft by Welch et al. entitled “pNFS Operations draft-welch-pnfs-ops-02.txt” dated Jun. 9, 2005. The key feature of the proposal is said to be the ability for clients to perform read and write operations that go directly from the client to individual storage elements without funneling all such accesses through a single file server. The file server must coordinate the client I/O so that the file system retains its integrity. The extension adds operations that query and manage layout information that allows parallel I/O between clients and storage system elements. The layouts are managed in a similar way as delegations in that they have leases and can be recalled by the server, but layout information is independent of delegations. pNFS has several ways to access data in parallel using files, block devices and object storage devices.
The pNFS proposal is based on a file mapping protocol (FMP) developed by EMC Corporation of Hopkinton, Mass. Various aspects of FMP are described in Xu et al. U.S. Pat. No. 6,324,581 issued Nov. 27, 2001; Vahalia et al. U.S. Pat. No. 6,389,420 issued May 14, 2002; Jiang et al. U.S. Pat. No. 6,453,354 issued Sep. 17, 2002; and Vahalia et al. U.S. Pat. No. 6,973,455 issued Dec. 6, 2005; all of which are incorporated herein by reference. More particularly, the pNFS proposal applies some aspects of FMP to NFS, as suggested in a position paper by Stephen Fridella, Xiaoye Jiang, and David Black, entitled “Elements of a Scalable Network File System Protocol,” presented at the NEPS workshop, University of Michigan Center for Information Technology Integration, Dec. 4, 2003.
As suggested by Fridella et al., a successful scalable network file system protocol should extend existing, proven network file system protocols, provide clients with a meta-data management interface for each file, and allow for an extensible notion of a block address space (or logical volume). The meta-data management interface should include an operation to query a file's block mapping in order to allow a client to read data directly from the storage fabric, an operation to allocate new blocks and (provisionally) add them to the file's block mapping in order to allow a client to write data directly to the storage fabric, an operation to commit or release the provisionally allocated blocks, mechanisms to maintain the consistency of file data when multiple clients are sharing access to a single file, and a mechanism (such as a lease mechanism) that enables consistent recovery from client or server failure. In accordance with the FMP protocol, the client sends requests to a metadata server. These requests include a “get_mapping” request, an “allocate_block” request, and a “commit_block” request. In addition, a client may send an initial “volume-discovery” request in order to discover shared storage objects or files for a specified namespace or file system.
The following is an example of using the FMP requests for reading data. In a compute node, a local application sends a file read request to a FMP client. The FMP client in the compute node sends a “get_mapping” request to a metadata server. The metadata server reads the file mapping, and returns a reply to the FMP client. The FMP client uses the file mapping to read file data from storage, and then the FMP client returns the file data to the local application. For subsequent reads, the FMP client may use cached data or cached mappings.
The following is an example of using the FMP requests for writing data. In a compute node, a local application sends a file write request to the FMP client. The FMP client in the compute node sends an “alloc_block” request to the metadata server. The metadata server pre-allocates storage blocks, and returns a reply to the FMP client. The FMP client returns a reply to the local application, and writes the cached file data to the pre-allocated blocks in storage. Then the FMP client sends a “commit_block” request to the metadata server. The metadata server responds by adding the blocks to the file mapping. For subsequent writes, the FMP client may use cached pre-allocated blocks.