1. Field of Invention
The present invention relates generally to the field of distributed file systems. More specifically, the present invention is related to a distributed file system supporting multiple locking protocols wherein one of the protocols is specifically suited for wide area data replication.
2. Discussion of Relevant Art
Distributed file systems have become the principal method for sharing data in distributed applications. Programmers understand local file system semantics well and use them to gain access to shared data easily. For exactly the same reason that distributed file systems are easy to use, they are difficult to implement. The distributed file system takes responsibility for providing synchronized access and consistent views of shared data, shielding the application and programmer from these tasks, by moving the complexity into the file system.
The file system responsibilities include data consistency and cache coherency. In a file system, the data consistency guarantee describes the interaction between distributed clients that concurrently read and write the same data. Most often, file systems implement sequential consistency—“a multi-processor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program,” see L. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, C-28(9):241-248, 1979. Sequential consistency is the strongest consistency guarantee. However, some applications benefit from weakening the consistency guarantee in order to achieve better performance.
Cache coherency is the task of making sure that the data cached by a client does not violate the consistency guarantee and generally requires a locking protocol. For cache coherency protocols, the semantics of the locks used to implement the protocol define how data is managed in the system. However, in general, cache coherency locking is designed the other way around. Developers of a distributed data system choose the data model appropriate to their expected workload, and implement a set of locks that conform to that model.
For many distributed applications, the consistency guarantee and caching provided by the file system is more than adequate. These applications write to the file system interface and ignore all the complexity of managing distributed data.
Other distributed applications are not compatible with the file system consistency and coherency model, but would still like to take advantage of other file system services. A file system provides many things beyond consistency and caching, including: a name space, through which objects can be accessed; data management functions, such as backup and restore and hierarchical storage management; data reliability and availability; striping and RAID; distributed logical volume support, and mapping logical addresses to physical devices.
Locking systems supporting file system caching are appropriate for applications which do not perform their own caching, however, to optimize performance, availability and recovery, databases and other distributed applications manage their own data caches. Applications have knowledge of the structure of the data on which they operate and the semantics with which the data are changed. Consequently, applications can cache data at a finer granularity than a file system can, and therefore achieve more parallelism. They also can customize their distributed application's caching properties, e.g. implement a cooperative cache or a different consistency guarantee.
Implementing these applications on file systems that cache data results in data being cached twice. Also, file systems that cache written data prevent application caches from realizing as much parallelism as possible and reduce performance.
In order to illustrate the manner in which caching at the file system limits parallelism, assume an application that caches its own data at a fine granularity, a cache unit of 4 k bytes, constructed on a file system that caches data in larger segments. Now, if two clients concurrently write separate application cache units, they expect the writes to occur in parallel. However, because both writes lie with the same file system cache segment, concurrent writes are executed serially. That is, the first client to write to the file system obtains a write lock for that segment, and writes the data to the file system's cache. When the second client attempts to write, the write lock of the first client is revoked. The first client commits its changes to disk. The second client must then read the segment from disk before applying its write to the segment. Because the file system segment is larger than the application cache segment, writes often need to be serialized.
Many databases and parallel I/O applications use either a physical device or logical volume interface to avoid caching and obtain good parallel data performance. While device or volume access solves performance problems, consumers often prefer to build databases in file systems for manageability reasons. The advantages of a file system include name space, integrated backup, and extensibility. File systems give a hierarchically organized name to database objects, which is easier to understand than volume or device names. Data stored in file systems can be copied, moved, or backed up with file system tools, and databases offload this function to the file system. Finally, files make databases easy to extend and do not require an administrator for this task.
As described previously, some applications benefit from weakening the consistency guarantee. An area which can benefit from weakening the consistency guarantee is data replication. For reasons of manageability and ease of implementation, and despite the distributed file system's performance shortcomings, enterprise customers often choose to use file systems to replicate data among many web servers. Distributed file systems offer the web developer a well known interface that appears identical to a local file system. The interface shields the complexity of data distribution and consistency issues from the application. Web applications are written for a single server, and then deployed with no modifications on many computers. The deployment leverages the enterprise's existing distributed file system, and no new distributed systems need to be installed or administered.
The limitation of using a file system for wide area replication of web data is performance. Generally, file systems implement data locks that provide strong consistency between readers and writers. For replicating data from one writer to multiple readers, strong consistency produces lock contention—a flurry of network messages to obtain and revoke locks. Contention loads the network and results in slow application progress.
The posting of Olympics race results provides an example application to show how strong consistency locking works for replicating web servers. A possible configuration of the distributed system is illustrated in FIG. 1. Hundreds of web servers 104 have file system clients on the back end which read the web pages from a file system cache 102. Additionally, a database management system (DBMS) 106 reads and writes from a file system cache 102. The file system caches 102 communicate with a file system server 100. Race results and other dynamic data are inserted into the database 1 through an interface like SQL or ODBC. The insertion updates database tables, which sets off a database trigger 2. The trigger causes the databases to author a new web page and performs a write operation 3. The file system is responsible for distributing the new version of the web page to the remainder of the web servers 104 in a consistent fashion. The I/Os are performed out of file system caches, which the locking protocol keeps consistent. Web servers take HTTP requests, which result in data reads from the file system.
Poor performance occurs for a strong consistency locking protocol when updates occur to active files. For this example, we assume that the new file that is being updated is being read concurrently by multiple web servers before, during, and after new results are being posted and written. For race results during the Olympics, a good example would be a page that contains a summary of how many gold, silver and bronze medals have been won by each country. This page has wide interest and changes often. The system's initial configuration has the file system clients at all web servers holding a shared lock for the file in question. FIG. 2a displays the locking messages required to update a web page in a timing diagram.
The best possible performance of strong consistency locking occurs when the file system client at the database is not interrupted while it updates the web page. In this case, the writing client requests an exclusive lock on the file 200, which revokes all concurrently held read locks 202, 204. After revocation, the writer, in this case the DBMS, is granted an exclusive lock 206. When the writer completes, the client at each web server must request and be granted a shared lock on the data 208, 210, 212, 214 to reobtain the privilege to read and serve the web page. All messages to and from the web server occur for every web server. In the best case, four messages, REVOKE, RELEASE, ACQUIRE, and GRANT, go between each web server and the file system server. For large installations, multiple messages between web servers and the file server consume time and network bandwidth prohibitively.
When preemptible locks are implemented in the locking system, additional difficulties are incurred during an update by the DBMS, particularly when the DBMS is interrupted in the middle of updating the file. Data locks are preemptible, so clients that request read locks on the file, revoke the DBMS's exclusive lock. The more highly replicated the data is, the more web servers there are, and the more likely the DBMS will get interrupted. Requests for the lock can stall the update indefinitely. Furthermore, if the DBMS is interrupted in the middle of writing the new file, the update is incomplete, and the data that a reading client (Web Server) sees is not a parsable HTML document.
The example application presents a nonstandard workload to the file system. The workload lacks the properties a file system expects, and therefore operates inefficiently. For example, the workload does not have spatial locality to clients. Instead, all clients are interested in all files.
Performance concerns aside, strong consistency is the wrong model for updating web pages. Reading clients cannot understand intermediate updates and byte-level changes.
Outside the intermediate update problem, strong consistency is still not useful for serving the HTTP protocol. Since the web client/server protocol (HTTP) does not implement consistency between the browser cache and the server, a strong consistency guarantee at the server cannot be passed on to the ultimate consumers of the data at their browsers. Therefore, a weaker consistency model looks identical to web users and can be more efficiently implemented.
For existing distributed systems, the Andrew file system (AFS) is a system which provides weak consistency and comes close to implementing an efficient model for such data replication. AFS does not implement strong consistency, rather it chooses to synchronize file data between readers and writers when files opened for write are closed, see M. L. Kazar, “Synchronization and caching issues in the Andrew file system,” USENIX Winter Technical Conference, February 1988. AFS cites statistics that argue that data are shared infrequently to justify this decision. The successor product to Andrew the Decorum file system, see Kazar et al., “Decorum file system architectural overview,” Proceedings of the Summer USENIX Conference, June 1990, currently known as DFS, reversed this decision, because they found that distributed applications sometimes require sequential consistency for correctness. While strong consistency is now the de facto standard for file systems, AFS-style consistency remains useful for environments where concurrent action and low overhead dominate the need for correctness.
FIG. 2b shows the protocol used in AFS to handle the dynamic web updates in the example. At the start of the timing diagram, all web servers hold the web page (file) in question open for read. In AFS, an open instance 220 registers a client for callbacks—messages from the server invalidating their cached version. The DBMS opens the file for writing 220, receives the data from the server 222, writes the data to its cache, and closes the file. On close, the DBMS writes its cached copy of the file back to the server 224. The file system server notifies the web servers of the changes to the file by sending invalidation messages to all file system clients that have registered for a callback 226. The clients then open 228 and read 230 the updated data.
When compared to the protocol for strong consistency locking, AFS saves only one message between client and server. However, this belies the performance difference. In AFS, all protocol operations are asynchronous—file system clients never wait for lock revocations before proceeding. AFS eliminates lock contention and eliminates waiting for the locking protocol. The DBMS obtains a write instance from the server directly, and need not wait for a synchronous revocation call to all clients. Another significant advantage of AFS is that the old version of the web page is continuously available at the web servers while the file is being updated at the DBMS.
The disadvantage of AFS is that it does not correctly implement an efficient model for data replication. The actual behavior is that the AFS clients write dirty data back to the server when closing a file, and AFS servers send callback invalidation messages whenever clients write data. In most cases, these policies result in an appropriate consistency. However, if a writing client writes back some portion of its cache without closing the file, a callback is sent to all registered clients, and reading clients can see partial updates. This most often occurs when a writing client, in our example of the DBMS, operates under a heavy load or on large files. In these cases, the cache manager writes back dirty blocks to the server to reclaim space.
With the above in mind it is apparent there exists a need for a distributed storage management system which is useful to a variety of applications. Furthermore, there is a need for a locking protocol which is designed for efficient data replication.