The present invention relates generally to storage systems and, more particularly, to a directory-level referral method for parallel network file system with multiple metadata servers.
Recent technologies in distributed file system, such as parallel network file system (pNFS) and the like, enable an asymmetric system architecture, which consists of a plurality of data servers and a dedicated metadata server. In such a system, file contents are typically stored in the data servers, and metadata (e.g., file system namespace tree structure and location information of file contents) are stored in the metadata server. Clients first consult the metadata server for the location information of file contents, and then access file contents directly from the data servers. By separating the metadata access from data access, the system is able to provide very high I/O (Input/Output) throughput to the clients. One of the major use cases for such system is high performance computing (HPC) application.
Although metadata are relatively small in size compared to file contents, the metadata operations may make up as much as half of all file system operations, according to the studies done. Therefore, effective metadata management is critically important for the overall system performance. Modern HPC applications can use hundreds of thousands of CPU cores simultaneously for a single computation task. Each CPU core may steadily create/access files for various purposes, such as checkpoint files for failure recovery, intermediate computation results for post-processing (e.g., visualization, analysis, etc.), resulting in tremendous metadata access. A single metadata server is not sufficient to handle such metadata access workload. Transparently distributing such workload to multiple metadata servers and providing a single namespace to clients hence raises an important challenge for the system design. Traditional namespace virtualization methods fall into two categories, namely, server-only-virtualization and client-server-cooperation.
Server-only-virtualization methods can be further categorized into two sub-categories, namely, synchronization and redirection. In a synchronization method (U.S. Pat. No. 7,987,161), the entire namespace is duplicated to multiple metadata servers. Clients can access the namespace from any metadata servers. Any update to the namespace is synchronized to all the metadata servers. A synchronization method has limited scalability due to high overhead for namespace synchronization. In a redirection method (U.S. Pat. No. 7,509,645), the metadata servers maintain information about how the namespace is distributed. Once a client establishes connection with a metadata server, the client will always access the entire namespace through the same metadata server (called local server). When the client needs to access a namespace portion that is not stored in the local server, the local server redirects the access to another metadata server (called remote server) where the namespace portion is located. Once the local server receives the reply from the remote server, it will send the reply to the client. A redirection method has low overall system performance due to such access redirection overhead.
Client-server-cooperation methods can also be further categorized into two sub-categories, namely, distribution-aware and referral-based. In a distribution-aware method (U.S. Patent Application Publication No. 2011/0153606A1), each client has a distribution-aware module which maintains information about how the namespace is distributed, and is able to access a namespace portion from the metadata server where the namespace portion is located. However, a distribution-aware method requires a proprietary client and hence limits its use cases. In a referral-based method (U.S. Pat. No. 7,389,298), a client can seamlessly navigate a namespace across pre-created referral points with a single network mount. However, the referral points can only be created on exported file systems by a system administrator in advance. Workload balancing is coarse-grain and requires manual reconfiguration by the system administrator to relocate referral points. Hence, there is a need for a new namespace virtualization method to overcome the aforementioned shortcomings.