1. Field of the Invention
This invention relates to the field of filesystems and, more particular, to distributed coherent filesystems.
2. Description of the Related Art
Distributed computing systems, such as clusters, may include two or more nodes, which may be employed to perform computing tasks. Generally speaking, a node is a group of circuitry designed to perform one or more computing tasks. A node may include one or more processors, a memory and interface circuitry. Generally speaking, a cluster is a group of two or more nodes that have the capability of exchanging data between the nodes. A particular computing task may be performed upon one node, while other nodes perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among the nodes to decrease the time required to perform the computing task as a whole. Clusters may provide high-availability such that if one node fails, the other nodes of the cluster can perform the required tasks. Generally speaking, a processor is a device configured to perform an operation upon one or more operations to produce a result. The operations are performed in response to instructions executed by the processor.
A filesystem may be employed by a computer system to organize files and to map those files to storage devices such as disks. Generally speaking, a storage device is a persistent device capable of storing large amounts of data. For example, a storage device may be a magnetic storage device such as a disk device, or an optical storage device such as a compact disc device. Data on storage devices may be organized as data blocks. The filesystem organizes the data blocks into individual files and directories of files. The data within a file may be non-contiguously stored in locations scattered across the storage device. For example, when a file is created, it may be stored in a continuous data block on the storage device. After the file has been edited, unused gaps within the file may be created. The filesystem may allocate the unused gaps to other files. Accordingly, the original file may now be stored in several non-contiguous data blocks.
Filesystems include internal data, called meta-data, to manage files. Meta-data may include data that indicates: where each data block of a file is stored, where memory-modified versions of a file are stored, and the permissions and owners of a file. The above uses of meta-data are for illustrative purposes only and are not intended to limit the scope of meta-data. In one conventional filesystem meta-data includes inode data. Inode data specifies where the data blocks of a file are stored on the storage device, provides a mapping from the filesystem name space to the file, and manages permissions for the file.
Directories are types of files that store data identifying files or subdirectories within the directory. Directory contents, like plain files, are data blocks. The data blocks that comprise a directory store data identifying the location of files or subdirectories within the directory.
Conventional filesystems, such as the Unix filesystem (UFS) or Veritas filesystem (VxFS), may not have a filesystem interface whereby the operating system can flush meta-data to a storage device. For the purpose of clarity, data that comprises data files and subdirectories will be referred to as xe2x80x9cactual dataxe2x80x9d to distinguish it from xe2x80x9cmeta-dataxe2x80x9d.
In distributed computer systems, filesystems may be implemented such that multiple nodes may access the same files. Generally speaking, a distributed filesystem is a filesystem designed to operate in a distributed computing environment The distributed filesystem may handle functions such as data coherency. Distributed filesystems address several important features of distributed computing systems. First, because multiple nodes can access the same file, the system has high availability in the event of a node failure. Additionally, performance may be increased by multiple nodes accessing the data in parallel. The performance increase is especially evident in systems, such as Internet web servers, which service mainly read data requests.
Because multiple nodes may access the same data in a distributed filesystem, data coherency must be addressed. For example, a first node may modify a copy of a file stored in its memory. To ensure consistency, an algorithm must cause the first node to flush file data and meta-data to a storage device, prior to a second node accessing that file. Accordingly, the second node may require a way to cause the first node to flush the data to the storage device. In conventional filesystems, such as UFS and VxFS, external interfaces may be used to cause the filesystem to flush actual data to the storage device. Conventional filesystems, however, typically do not have an external interface to cause the filesystem to flush meta-data to the storage device. Accordingly, it is difficult to use conventional filesystem as building blocks of distributed filesystems, and distributed filesystems are typically developed from scratch or substantially modified versions of conventional filesystems. Unfortunately, it is a major investment of time and money to develop a filesystem from scratch, and even more of an investment to develop a distributed filesystem from scratch. Additionally, developing a distributed filesystem from scratch makes it difficult to leverage off evolving filesystem technology and to take advantage of the familiarity and support of existing filesystems. Users of a new filesystem must learn and adapt to an unfamiliar user interface. Alternatively, if a distributed filesystem is designed to accommodate users of different filesystems, the developer must develop and support many different versions of the distributed filesystem.
Another shortcoming of current distributed filesystems is that the filesystems typically only support one coherency algorithm. Different algorithms are advantageous in different systems. For example, one coherency algorithm may work well in systems with very few write operations but not work well in systems with a higher percentage of write operations.
Still another shortcoming of current distributed filesystems is that the granularity of a coherency unit is typically fixed. Generally speaking, a coherency unit is a group of data for which coherency is maintained independent of other data. For example, the granularity of a coherency unit may be specified as a file. Accordingly, a coherency operation may be required if two nodes attempt to access the same file. This granularity may work well in systems with relatively small files that are infrequently written to by different nodes, but it may not work well in systems with large files to which different portions are written by different nodes. In the latter system, a granularity of one or more pages of a file may be advantageous.
What is desired is a distributed filesystem that uses largely unmodified conventional filesystems as building blocks, such that the distributed filesystem may interface to a conventional filesystem to provide the distributed nature of the filesystem. Additionally, a distributed filesystem in which the coherency algorithm and the granularity of a coherency unit may be selected by a client is desirable.
The problems outlined above are in large part solved by a highly-available cluster coherent filesystem in accordance with the present invention. The cluster filesystem is a distributed filesystem that includes a layer that interfaces between the rest of the operating system and conventional local filesystems operating on each node. A meta-data stub is developed that includes code for flushing meta-data to a storage. In one embodiment, the code of the meta-data stub is a copy of the internal code for flushing meta-data of the local filesystem. When a coherency operation is performed, the cluster filesystem may cause the local filesystem to flush actual data to a storage device and the meta-data stub to flush meta-data to the storage device. In this manner, a coherent cluster filesystem may be developed from conventional local filesystems.
Developing the cluster filesystem from conventional local filesystems has several distinct advantages. The functions performed by the local filesystem do not need to be replicated, which reduces the amount of code that needs to be developed, tested and supported. Additionally, the cluster filesystem may be easily adapted to a variety of local filesystems. To support a new local filesystem, minimal modifications may be required. For example, the interface between the cluster filesystem and local filesystem may be modified to support the new local filesystem interface, and the internal code of the local filesystem for flushing, meta-data may be copied to the meta-data stub. Because the cluster filesystem may be easily adapted to new local filesystems, the cluster filesystem may support a variety of local filesystems. Accordingly, a user that prefers to use one conventional local filesystem rather then adapt to a new cluster filesystem may be supported. For example, a user may prefer the UFS. If a cluster filesystem is developed from scratch, the user interface may emulate the Veritas filesystem interface. The user that prefers the UFS may be required to learn the Veritas filesystem interface. To satisfy the user, a second version of the cluster filesystem that emulates the Veritas filesystem interface may be developed. This may not satisfy another user that prefers still another filesystem interface and requires two versions of the program to be developed, tested and supported. By developing a cluster filesystem from conventional local filesystems, the cluster filesystem may be easily adapted to variety of local filesystems. For example, when a new version of UFS is available, the cluster filesystem may operate with the new version with little or no modification. In contrast, a cluster filesystem developed from scratch would have to be redesigned to encompass the improvements of the new filesystem. Further, a cluster filesystem according to the present invention advantageously allows the cluster filesystem to take advantage of improvements in local filesystems. Still further, by leveraging existing filesystems, the present invention leverages the testing and reliability of the local filesystems. Accordingly, the instance of bugs within the filesystem may be reduced.
Additionally, the cluster filesystem may support a plurality of coherency algorithms. An administrator may select the coherency algorithm that best fits the operating environment of the computer system, which may increase system performance. Further, an administrator may select the granularity of a coherency unit which may reduce the number coherency operations in the system.
Broadly speaking, the present invention contemplates a distributed filesystem configured that includes a first node, a second node and a storage device, wherein the distributed filesystem includes a first local filesystem, a second local filesystem, a first meta-data stub, a cluster filesystem layer. The first local filesystem is configured to operate on the first node and the second local filesystem is configured to operate on the second node. The first and second local filesystems are non-distributed filesystems. The first meta-data stub is configured to flush meta-data of the first local filesystem to the storage device. The cluster filesystem layer is configured to interface to the first local filesystem, the second local filesystem and the first meta-data stub and is configured to output a command to the first meta-data stub to flush the meta-data to the storage device.
The present invention further contemplates a distributed computing system and a distribute filesystem. The distributed computing system includes a first node, a second node, a storage device and a data communication link coupled between the first node and the second node. The distributed filesystem includes a first local filesystem, a second local filesystem, a first meta-data stub, a cluster filesystem layer. The first local filesystem is configured to operate on the first node and the second local filesystem is configured to operate on the second node. The first and second local filesystems are non-distributed filesystems. The first meta-data stub is configured to flush meta-data of the first local filesystem to the storage device. The cluster filesystem layer is configured to interface to the first local filesystem, the second local filesystem and the first meta-data stub and is configured to output a command to the first meta-data stub to flush the meta-data to the storage device.
The present invention still further contemplates a method of maintaining coherency in a distributed computing system that includes a first node and a second node operating a cluster filesystem, and a storage device, wherein the cluster filesystem includes a first local filesystem, a first cluster filesystem layer that interfaces to the first local filesystem, a second local filesystem, and a second cluster filesystem layer that interfaces to the first cluster filesystem layer and the second cluster filesystem, the method comprising: developing a first meta-data stub to flush meta-data of the first node to a storage device; the first node receiving a coherency request from the second node; the first local filesystem of the first node flushing actual data to the storage device; and the first meta-data stub of the first node flushing meta-data to the storage device.