Most complex business applications are run not on a single computer system, but in a distributed system in which multiple computer systems, referred to as nodes, each contribute processing resources and perform different tasks. Some types of distributed applications involve sharing a single file among one or more nodes, where more than one node can update the shared file. Examples of such distributed applications include seismic data processing, imaging, scientific computation, and so on.
When a file is shared, more than one program may wish to read and/or write data to the shared file at the same point in time. Concurrent read operations are commonly supported by most applications and file systems, but concurrent write operations require more sophisticated management techniques and are less commonly supported. For example, assume that one writer program writes AA to a given location in a file, and another writer program writes BB to the same location in the file at the same time. After both write commands are completed, a reader program should read either AA or BB, and not AB, from that location in the file, depending upon the order in which the write operations complete. This functionality follows the Portable Operating System Interface (POSIX) standard, which defines the language interface between application programs and the Unix operating system. Adherence to the POSIX standard ensures compatibility when programs are moved from one Unix computer to another, a highly desirable feature. For example, the standard “read” and “write” I/O commands are expected to operate as described above when performed by any application program on any operating system.
To assure POSIX compliance, most file systems require that only one program can write to the file at one time, and no programs can read the file while data are being written to the file. However, compliance with these requirements requires serializing write commands and delaying read commands, thereby degrading performance significantly, particularly with very large shared files.
FIG. 1A shows sequential execution of write operations on a file. Input/output commands are being executed on file 102 by processes 110A, 110B, 110C, and 110D. File 102 is shown as including eight portions (labeled 0 through 7). While a single character is shown to represent the data contained in each portion of file 102, one of skill in the art will understand that the data are for illustration purposes and represent portions of the file containing one or more bytes. One of skill in the art will also understand that the term “portion” is used in a general sense to indicate the units in which file 102 is read, whereas files are often described as being read as blocks or regions of data, where multiple blocks occur within a given region. No particular unit of measure is intended by use of the term “portion” herein.
In FIG. 1A, process 110A writes data to portion 1 (P1) of file 102, and process 110B writes data to portions 5 and 6 (P5 and P6). Process 110C reads data from portions 4 through 7 of file 102, and process 110D extends the size of file 102 to ten portions. Processes 110B and 110C can be said to request “conflicting operations,” because the portions targeted by the operations overlap and at least one of the operations is a write operation. The requirement that one of the operations is a write operation takes into account that multiple simultaneous read operations are allowed by most file systems, even if the portions targeted by each read operation overlap.
In a typical prior art file system, using standard “read” and “write” commands, only one of processes 110A, 110B, and 110D can write to file 102 at one time. Furthermore, process 110C cannot read from file 102 while any of processes 110A, 110B, and 110D is writing data to file 102. Also, using standard file system interfaces, no I/O commands can be executed during an operation that changes the size of a file or allocates new space to a file. As a result, each of processes 110A, 110B, and 110D is executed sequentially; in the example shown, process 110A executes first at time t1 and process 110B writes data to file 102 second at time t2. Finally, process 110D extends the size of file 102 at time t3. One of skill in the art will recognize that a different sequential ordering may be followed, depending upon the order in which processes 110A, 110B, and 110D initiate the respective operations. Process 110C can read from file 102 only at a point in time when none of processes 110A, 110B, and 110D is writing to file 102.
Several attempts to enable concurrent input and output to a file have been made. Some file systems provide specific application program interfaces (APIs) that allow concurrent programs to perform I/O operations on a single file. These file systems typically rely upon application programs to use these file system-specific interfaces to synchronize conflicting input/output commands so that a consistent view of the file is seen by multiple readers and writers. Rather than use standard file system-independent, POSIX-compliant I/O commands, special commands or interfaces are used to perform I/O to files, thereby requiring application programs to be changed for different file systems.
For example, Veritas Software Corporation's File System (VxFS) provides commands and interfaces including qio (“quick I/O”) and Oracle Corporation provides Oracle Disk Manager for Oracle database files. Specific APIs include a special file system-specific open flag for opening a file for concurrent I/O and a special file system-specific mount option to mount the file system to enable concurrent I/O. However, these options cannot be used for all I/O operations on all types of files. Depending upon the specific interface used, some APIs can be used only for direct I/O or within a special namespace. Other APIs require applications to perform locking. In some file systems, if an application does not properly synchronize I/O commands, inconsistent data may be produced with overlapping write commands (such as the AB result described in the scenario above).
In some Unix file systems, a “range locking” facility is provided to enable file and record locking. This facility can be used by cooperating concurrent processes to access regions in a file in a consistent fashion. A file region can be locked for shared or exclusive access, and the lock serves as a mechanism to prevent more than one process from writing to the same file region at once. However, this type of range locking controls cooperation between processes with regard to writing data to the same region of a file, but does not affect whether I/O operations can execute concurrently on different regions of the same file. The file system itself serializes I/O operations on different regions of the same file.
When more than one node performs I/O operations to the same file, each node may operate with its own cache rather than writing directly to the file itself. In some clustered environments, the portions of a file that have been written by each node are tracked in a table or bitmap at each respective node. A write operation produces a “dirty page” in the cache on the node performing the write operation. Dirty pages may be tracked by setting a bit in a bitmap. When another node requests to perform I/O operations on a portion of a file that has been written by another node (corresponding to the dirty pages in the cache), the node holding a lock for the file “flushes” the updated value from cache to disk before relinquishing the lock. In some implementations, the cached value for the portion of the file just written is also invalidated. In other implementations, a bit may be set indicating that the value in that portion of the cache is no longer dirty, indicating that the value in that location of the cache can be overwritten.
In prior art systems, using one lock for the entire file required that the entire cache for all dirty pages be written to disk. If node A has dirty pages corresponding to the data in the first half of the file and another node B reads from the second half of the file, the dirty pages on node A will be unnecessarily flushed to disk since B will not be reading that data. Similarly, if node A has dirty pages corresponding to the first half of the file and another node B writes to the second half of the file, the dirty pages on node A will be written to disk and the page cache on node A invalidated before those operations are needed. Furthermore, any bitmap corresponding to regions changed in the file will be invalidated.
What is needed is a way to efficiently coordinate caching operations between nodes operating on the same file while allowing different regions of the file to be written concurrently.