The present invention relates to data caching in geographically distributed file systems.
Distributed wide-area file-caching products allow enterprises to easily make data available across geographically dispersed locations. These systems usually comprise a central “home” file system cluster, and a set of “secondary”, i.e., “cache” file system clusters. The “home” cluster contains the master copy of all the files and directories, and the secondary, i.e., cache file system clusters essentially cache copies of files from the home cluster. This caching occurs when an operation is performed on a file in the secondary, i.e., cache file system cluster.
In the cache cluster, cache misses on read operations (e.g. read( ) or fstat( ) must be handled synchronously, i.e. the caller is blocked until sufficient data has been fetched from the home site. However, modifications to the file system namespace or file content can be pushed from the cache site to the home site in either a synchronous (write-through) or asynchronous (write-back) fashion.
Existing state-of-the-art “write-through file caching” products invalidate the home copy of the file when it is modified on the cache site. In existing “write-through file caching products” when a modification occurs, the cache site's file system discards the locally cached copy of the file, and performs the write operation directly on the home site. Then, a subsequent read operation issued by the application on the cache site causes the file to be re-cached from the home site. Therefore, even though the write committed successfully from the application's perspective, the file may become unavailable, if the home site fails or if the cache site becomes disconnected from the home, since the file is not re-cached locally until the next read. Furthermore, read operations that follow a write, actually incur write-amplification, due to the re-caching of the file.